Speech to Text API for next-level apps

Convert speech to text with unmatched accuracy, ultra-low latency, and enterprise scalability. Deepgram’s speech-to-text API powers everything from transcription and analytics to real-time, human-like voice agents.

Speak your mind, we'll turn it into text. No typos, no autocorrect drama.

Ready to build?

Start free with $200 in credits | Show me the docs

Trusted by the world’s top Enterprises and Startups

MEET FLUX

The conversational Speech to Text model

Flux is the first speech-to-text model designed for conversation, not just transcription. With built-in turn detection, ultra-low latency, and natural interruption handling, Flux enables real-time, human-like voice agents.

Integrated turn detection for natural flow
Sub-300ms end-of-turn latency
Conversational cues for agents to act on
Nova-3 level transcription accuracy

Learn more about Flux

Model Overview

Deepgram models power everything from real-time conversations to domain-specific transcription, with options for speed, accuracy, and full customization.

Flux

Conversational speech recognition for real-time voice agents with built-in turn detection, natural interruption handling, and ultra-low latency.

Learn More

Nova-3

High-performance speech-to-text for production transcription with top accuracy, multilingual support, and noise robustness.

Learn More

Industry-tuned

Specialized speech-to-text models optimized for industry-specific vocabulary and structure for domains like healthcare, legal, and finance.

Learn More

Custom

Custom speech-to-text models trained on proprietary or novel datasets for maximum accuracy in edge-case scenarios.

Learn More

Built for the real world

Deepgram models maintain high transcription accuracy even in noisy, accented, or overlapping speech, making them ideal for real conversations.

Learn More

Speech to Text in 36+ languages

Build global applications with Deepgram’s speech-to-text API, which supports transcription in over 36 languages and dialects for real-time and recorded audio.

Explore the Languages

Ultra-low latency for real-time apps

Deepgram delivers transcripts in under 300 milliseconds, enabling voice agents and conversational AI to respond instantly and naturally.

Learn More

Discover Speech to Text capabilities

Deepgram’s speech-to-text features give developers everything they need to produce accurate, readable, and secure transcripts out of the box.

View all features

Keyterm prompting

Improve recognition of critical words or phrases with up to 90% higher keyword recall rate (KRR).
Learn more →

Filler words

Transcribe interruptions in speech such as “uh” and “um” to capture a more natural, human-like transcript.
Learn more →

Smart formatting

Enhance readability with automatic punctuation, capitalization, and paragraphing.
Learn more →

Diarization

Detect speaker changes and label who said what in multi-speaker audio.
Learn more →

Numerals

Turn written numbers into digits (e.g., “one hundred” → “100”) for consistency.
Learn more →

Redaction

Automatically remove sensitive or personal information from transcripts.
Learn more →

Power real-world solutions with Speech to Text

Deepgram’s speech-to-text API enables accurate and scalable transcription across industries, including customer support, healthcare, media, and conversational AI.

Contact Centers

Medical Transcription

Conversational AI

Speech Analytics

Media Transcription

Accurate speech-to-text for call transcription, real-time analytics, and improved customer support. Flexible deployment and custom models scale across industries.

Learn More

Contact Centers

Accurate speech-to-text for call transcription, real-time analytics, and improved customer support. Flexible deployment and custom models scale across industries.

Learn More

Medical Transcription

Healthcare-ready speech-to-text that captures medical terms and specialized keywords at scale. Ensure compliance with HIPAA and industry standards while reducing documentation time. Real-time transcription supports faster clinical workflows and improves patient care outcomes.

Up to 40x faster transcription creation

of pre-recorded audio than alternatives.

Learn More

Conversational AI

Real-time speech-to-text with ultra-low latency and turn detection for human-like voice agents. Built for understanding complex conversations.

High-accuracy transcription
Custom vocabulary injection
Speaker diarization
Topic & language detection
End-of-thought detection

Learn More

Speech Analytics

Convert audio into text to analyze conversations, detect intent, and generate actionable insights.

Learn More

Media Transcription

Fast, affordable transcription for podcasts, videos, and broadcasts with accurate captions and summaries.

Rich content captioning
SEO and audience expansion
Content moderation & analytics
Searchability & user experience
Streamline workflows

Learn More

FAQs

Q: What is speech to text and how does it work?

A: Speech to text (STT), also called automatic speech recognition (ASR), converts spoken audio into written text. It powers use cases like transcription, analytics, accessibility, and conversational AI.

Q: What is a speech-to-text API?

A: A speech-to-text API is a developer interface that makes speech recognition accessible in apps and services. With Deepgram’s API, you can stream audio for real-time transcription or submit recordings for batch processing at scale.

Q: Does Deepgram support multichannel audio transcription?

A: Yes. Deepgram can process multichannel audio to separate speakers or combine channels for clarity. Nova-3 is especially strong for meetings and call transcription.

Q: What features does Deepgram support for transcription outputs?

A: Deepgram transcripts include smart formatting for punctuation, capitalization, and paragraphing, along with speaker diarization to identify who is talking. You can also enable numeral conversion so numbers are written as digits, detect filler words like “uh” and “um,” and apply vocabulary prompting to improve recognition of specialized terms. Optional redaction is available to remove personal information directly in the transcript. For a full list of transcription features, see our documentation.

Q: What are the key differences between Nova-3 and Flux?

A: Nova-3 is optimized for transcription at scale with best-in-class accuracy, multilingual support, and robustness in noisy environments. Flux is optimized for real-time conversation with built-in turn detection, natural interruption handling, and turn-complete transcripts.

Q: How accurate are Deepgram’s speech-to-text models?

A: Nova-3 delivers industry-leading accuracy with more than 50% lower word error rate (WER) compared to competitors in both streaming and batch transcription. Flux offers the same transcription accuracy but is optimized for real-time conversation with turn detection and low latency. For detailed benchmarks and comparisons, see our Nova-3 launch blog.

Q: What models does Deepgram offer?

A: Deepgram’s Speech-to-Text API includes multiple models to fit your needs: Flux for real-time conversation, Nova-3 for best-in-class accuracy, Industry-Tuned models for specialized domains, Custom models trained on proprietary datasets, and Nova-2 for cost-efficient transcription. See our Models Overview for details.

Q: How do I get started with Deepgram’s Speech-to-Text API?

A: Sign up for a free Deepgram account to access your API key. You can test models instantly in the Playground or jump into building with our starter apps on GitHub.

Q: How much does Deepgram speech-to-text cost?

A: Pricing depends on the model: Nova-3 for highest accuracy, Flux for conversational AI, Nova-2 for cost efficiency, and Industry-Tuned or Custom models for specialized domains. Visit our pricing page or start free in the Playground.

Q: Which model should I use for general transcription tasks?

A: Use Nova-3 for production transcription across meetings, media, and analytics. Industry-Tuned or Custom models may improve accuracy further for specialized domains.

Q: Which model is best for voice agents, chatbots, or contact centers?

A: Choose Flux for real-time conversational AI. It enables voice agents to respond naturally with turn detection, interruption handling, and low-latency events.

Trusted by startups and enterprises

Discover the power of our product through real stories.

Ready to get started?

Start building voice-first applications today with Deepgram’s speech-to-text API. It is fast, accurate, scalable, and easy to integrate.