Table of Contents
The speech layer underneath your Salesforce voice agents decides whether your CRM fills with useful data or expensive noise. Get it wrong, and transcription errors in account numbers, product names, and customer requests cascade into bad case records and failed automations.
You'll often catch the damage too late, after agents are live and data quality has already degraded. Now that Agentforce Voice is GA and voice is a first-class CRM channel, the integration choices you make at the speech layer matter more than ever.
This article maps the architecture the SERP doesn't give you. You'll learn the two paths available, what a production pipeline actually looks like, and how to choose an STT provider based on your domain complexity and call volume.
Key Takeaways
Here's what you need to know before architecting Salesforce voice agents for production:
- Agentforce Voice uses OpenAI Whisper for STT but publishes no WER or latency specs.
- The Salesforce Telephony Integration API is STT-agnostic. It ingests transcript text regardless of which provider produced it.
- Amazon Connect supports third-party STT provider swaps at the configuration layer, with no Lambda or flow changes.
- Flex Credits price voice actions, separate from telephony licensing.
Why Salesforce Voice Agents Rise or Fall on the Speech Layer
Your STT provider determines whether Salesforce voice agents create useful CRM records or pollute your data with transcription errors. Every misspelled product name, garbled account number, or missed intent becomes a corrupted Salesforce object.
How Transcription Errors Cascade into CRM Data
A voice agent transcribing "policy number BRX-4492" as "policy number BRS-4492" creates a lookup failure. The case record ties to the wrong account or creates a duplicate.
Downstream automations, assignment rules, and reporting inherit that initial error. Multiply this by thousands of daily calls, and your CRM quietly becomes unreliable. You usually notice right around the time someone runs the quarterly report.
Where Bundled STT Falls Short in Enterprise Workflows
Salesforce's STT service is built on OpenAI Whisper models adapted for streaming use. Internal benchmarks run against the LibriSpeech ASR Corpus. That corpus consists of clean read speech in controlled conditions.
No WER figures or latency thresholds from these benchmarks are publicly disclosed. LibriSpeech results tend to look more favorable than the unscripted telephony audio you'll actually run in production. Background noise, varied accents, spontaneous speech, and domain-specific vocabulary all degrade accuracy.
The Cost of Rework When Voice Data Enters Salesforce Dirty
Background noise, strong accents, and technical jargon affect transcription accuracy in Service Cloud Voice deployments. Call transfers can also produce duplicate or missing transcript segments. Each of these issues triggers manual review, case correction, and re-routing. You're paying agent time to clean up what the voice agent got wrong, the kind of work nobody budgeted for.
Two Paths to Voice Agents in Salesforce
You can build voice agents natively with Agentforce Voice. You can also integrate external voice AI APIs through Salesforce's telephony and middleware layers. Each path trades off control, accuracy, and complexity differently.
Native Agentforce Voice: What It Handles Today
Agentforce Voice, GA since the Winter '26 release cycle, handles autonomous inbound and outbound calls over PSTN and SIP trunking. It supports customer interruption mid-sentence, automatic conversation logging, and context transfer on escalation to human agents.
The Agentforce Mobile SDK, GA in Summer '26, supports iOS apps with injectable STT and TTS parameters through Swift optionals. Telephony partners include Amazon Connect, Five9, NICE, Vonage, and Genesys via SIP trunking.
The voice channel's supported language list isn't publicly enumerated in official Salesforce documentation as of 2026. Chat and text channels support English plus six additional languages at GA, with nine more in beta. You should verify the voice-specific language matrix directly with your Salesforce account team before you commit.
External Voice AI APIs with Salesforce as the CRM Layer
The Salesforce Telephony Integration API ingests pre-processed transcript text through REST endpoints. Those endpoints write to VoiceCall and ConversationEntry objects. It doesn't care which STT engine produced that text. This means you can use a dedicated provider at the telephony layer and still land clean transcripts in Salesforce.
Amazon Connect's third-party STT configuration explicitly supports provider substitution at the locale and bot level. No changes to contact flows or Lambda functions are required. For BYOT deployments through partners like Twilio or Five9, the partner controls STT engine selection. It then pushes transcript data into Salesforce through the same API endpoints.
When to Choose Each Approach
Native Agentforce Voice works well when your call volume is moderate, your domain vocabulary is general, and you want the fastest time to deployment.
You should choose the external API path when your workflows involve specialized terminology, such as medical codes, financial product names, or alphanumeric identifiers. It's also the better fit when you need published accuracy benchmarks or your telephony partner already supports custom STT configuration.
What a Voice-to-CRM Pipeline Looks Like in Production
A production pipeline for Salesforce voice agents streams audio through STT, routes the transcript for intent resolution, and writes structured data to Salesforce objects in real time. You'll feel the trade-offs most in latency, transcript quality, and handoff reliability.
The Streaming Architecture from Call to CRM Record
In an Amazon Connect deployment with Service Cloud Voice, the pipeline flows through two Lambda stages documented in the Salesforce Voice Developer Guide. First, the kvsConsumerTrigger Lambda function is invoked within the Amazon Connect contact flow to set up real-time call transcription.
It then calls the invokeTelephonyIntegrationApi Lambda, which generates a JWT token for authenticating into Salesforce's Telephony Integration API. That same Lambda starts Amazon real-time transcription by hooking the audio stream to the STT service.
Transcripts are sent to the rep console in real time. They are persisted to VoiceCall and ConversationEntry objects asynchronously.
For BYOT deployments, the partner's connector replaces the Lambda layer. It still writes to the same Salesforce objects through the Service Cloud Connector API. Regardless of which pattern you use, the lightning-service-cloud-voice-toolkit-api LWC component surfaces STT output in the agent console from API version 52.0 onward.
Where Latency and Accuracy Trade-offs Hit Hardest
According to the same Salesforce engineering team behind its Whisper-based STT service, 500ms is the delivery goal for captions to remain effective. Any external STT API in a custom middleware pattern must meet or beat this threshold after network round-trip time.
Real-time agent coaching, next-best-action recommendations, and live supervisor monitoring all depend on transcript data arriving within this window. Batch accuracy matters less than streaming accuracy here. Your STT provider must perform well on partial utterances, not just completed sentences.
Handling Handoffs Between Voice Agents and Human Agents
Agentforce Voice transfers case history, purchase data, and the full conversation transcript when escalating to a human agent. Supervisors can monitor live transcripts through the Command Center. The quality of that handoff depends on transcript accuracy. If the AI agent misheard the customer's issue, the human agent inherits a misleading case summary.
Choosing an STT and TTS Provider for Salesforce Workflows
Your STT and TTS provider choice affects transcription accuracy, voice agent latency, cost per call minute, and how cleanly data flows into Salesforce fields. You should evaluate providers on your own call audio, not on polished demo conditions.
Accuracy and Domain-Specific Terminology
General-purpose STT models struggle with industry vocabulary. Elerian AI, described in a Deepgram customer case study, builds NLU-driven digital agents for contact centers in South Africa. The case study says general ASR models delivered roughly 70% accuracy across their caller base.
After integrating Deepgram's trained speech models alongside their own NLU layer, they reported over 90% accuracy on domain-specific entity recognition. That is a customer-reported hybrid-stack outcome, not a standalone Nova-3 product spec. Deepgram's Nova-3 model delivers a confirmed 5.26% WER.
Keyterm Prompting lets you specify up to 100 domain-specific terms at inference time, with no retraining required. If your Salesforce workflows handle structured identifiers like policy numbers or order codes, this difference determines whether automated lookups succeed or fail.
Latency Requirements for Real-Time Voice Agents
Voice agents need streaming transcription that keeps pace with natural conversation. A response that arrives even 200ms late creates an unnatural pause. When you evaluate providers, test streaming latency under load, not just single-call benchmarks.
Cost Modeling for Voice Minutes at Enterprise Scale
Salesforce bills voice actions through Flex Credits. You also pay for a Service Cloud Voice telephony platform license. If you're on Amazon Connect, AWS charges separately for voice usage. Your STT provider adds a third cost layer. Model total cost per call across all three layers before you commit.
How to Start Building Voice Agents for Salesforce Today
Start with a single use case. Validate STT accuracy against your domain terminology. Then expand voice agent coverage as CRM data quality proves out.
Pick a Use Case with Measurable CRM Impact
Choose a workflow where transcription accuracy directly affects a measurable Salesforce metric: case deflection rate, average handle time, or self-service containment. Account verification calls work well because the success criteria are binary. Either the system correctly transcribed the account number, or it didn't.
Set Up Your STT Provider and Telephony Layer
If you're running Amazon Connect with Service Cloud Voice, configure your preferred STT provider at the Amazon Connect locale level. No Lambda or flow changes are needed. If you're using a BYOT partner, confirm whether that partner supports custom STT configuration.
For mobile deployments on iOS, the Agentforce Mobile SDK accepts custom implementations of the AgentforceSpeechTranscriber protocol through the speechRecognizer parameter. Deepgram's Voice Agent API combines STT, TTS, and LLM orchestration. That can simplify the middleware layer if you're building a custom integration.
Measure What Matters Before Scaling
Run a proof-of-concept against your own call recordings with ground-truth transcripts. Compare WER across providers using real telephony audio, not clean benchmark datasets. Track how transcription accuracy maps to CRM data quality metrics: field-level accuracy in VoiceCall objects, automation success rates, and case reclassification frequency.
Deepgram's Audio Intelligence features, including sentiment analysis, topic detection, and intent recognition, can help you evaluate transcript quality at scale before expanding to additional use cases.
Ready to test STT accuracy against your Salesforce workflows? Grab $200 in free credits and confirm the current offer at signup.
FAQ
Can You Use a Third-Party STT Provider with Salesforce Service Cloud Voice?
Yes. The swap happens at the telephony layer, not inside Salesforce. In practice, that means your migration work usually centers on Amazon Connect or your BYOT partner, while Salesforce still receives transcript text through the same API objects.
Does Agentforce Voice Support Languages Beyond English?
Public Salesforce documentation doesn't clearly enumerate the voice language list as of 2026. Before procurement, you should validate the voice-specific language matrix for your exact deployment path, especially if your rollout depends on one language for self-service containment.
How Does Salesforce Pricing Work for Voice Agent Conversations?
Voice interactions use Flex Credits, not a flat per-conversation rate. You'll also need telephony licensing, and Amazon Connect adds its own usage charges. The practical step is to model one completed call across every layer before you approve a rollout.
What Compliance Certifications Matter for Voice Agents in Salesforce?
Salesforce lists several Agentforce attestations, but you should still check scope. The useful procurement question is whether the controls you need apply to the voice workflow you plan to deploy, not just to the broader platform.
How Do You Measure Voice Agent ROI in Salesforce?
Track case deflection rate, field-level accuracy in VoiceCall objects, and average handle time on escalated calls. The key is to connect transcript quality to CRM outcomes, so you can tell whether better speech recognition is reducing manual cleanup work or just moving it around.









