Article·AI Engineering & Research·Aug 8, 2025
10 min read

Understanding and Reducing Latency in Speech-to-Text APIs

In this article, we’ll break down the major sources of latency in STT APIs and explore practical strategies to minimize them. You’ll also see how models like Nova-3 tackle the speed versus accuracy dilemma head-on.
10 min read
By Stephen Oladele
Last Updated

There are several key factors you consider when selecting an API for speech-to-text (STT) models. You have to check for accuracy, speed, cost, and speed.

Yes, you read speed twice because that’s how important it is in any STT API. A good rule of thumb is: the smaller the latency, the higher the adoption of your product. This leads to a better user experience!

Latency is one of those rare cases where less is more. But keeping it low is easier said than done. A typical STT API is full of moving parts: transcription delays, network overhead, API response times, any of which can become a bottleneck.

Of course, there’s often a tradeoff. Most STT providers try to reduce latency by sacrificing accuracy. It’s the classic compromise: choose between speed and accuracy.

In this article, we’ll break down the major sources of latency in STT APIs and explore practical strategies to minimize them. You’ll also see how Nova-3 tackles the speed versus accuracy dilemma head-on.

⏩ TL;DR

Why Is Latency the Speech-to-Text (STT) KPI Nobody Can Ignore?

To create a fantastic user experience (UX) in a voice application, you need to reduce latency. Latency is simply the amount of time it takes for an action to occur. In the case of an STT model or application, it refers to how long it takes for the model to transcribe the user’s speech into text.

Latency becomes especially critical in real-time transcription scenarios, where users expect to see their words appear on screen as they speak, or during live captioning. The goal is to minimize the delay between spoken words and the transcribed text.

If latency is high, it will almost certainly affect the perceived performance of the STT system. While other metrics like accuracy and cost are important, latency is the most noticeable and the least forgivable.

For example, a model with a high Word Error Rate (WER) might occasionally misinterpret words during live transcription, but many users will miss the mistake. What they won’t miss is a long delay. Slow transcription breaks the flow, disrupts the experience, and draws attention to itself.

That’s why latency remains one of the most important key performance metrics when evaluating or selecting a speech-to-text model.

What are the sources of latency in STT APIs?

The answer depends on several factors, including the technology stack, model architecture, network infrastructure, and all the steps that happen between when a user speaks and when the transcription is returned. 

But here are major sources where latency tends to creep in:

  1. Network: If you're using an STT model over a network, latency can occur during the transmission of audio data to the server and the return of the transcribed text to the user. This is especially noticeable in cloud-based systems.

  2. Encoding: When the user's speech is captured in raw form, it must first be encoded into a format suitable for transmission (e.g., compressed audio). This process takes time and can introduce a delay.

  3. Transcription latency: This is the core processing time, how long it takes the model to transcribe a chunk of audio into text.

  4. Transcription formatting: After the model generates the raw text, it may need to be formatted, adding punctuation, casing, or timestamps, before the client get the transcript. This step can introduce small but noticeable delays.

  5. Buffer size: In real-time transcription, the app may stream the audio in chunks rather than all at once. The size of each chunk (or buffer) directly affects latency. Larger buffers mean fewer interruptions but slower updates; smaller buffers provide faster feedback but may sacrifice context.

All of these factors combine to form the total latency. The saying "a chain is only as strong as its weakest link" applies perfectly to latency in STT systems. 

If even one source introduces a significant delay, it doesn't matter how fast the others are; the overall system will still feel slow.

Latency Funnel Breakdown: Where Do The Milliseconds Hide?

An STT API is made up of several components, each with its own runtime. The goal is to minimize the runtime of each stage, ideally bringing it down to just a few milliseconds, because in real-time applications, every millisecond counts.

You can think of the STT pipeline as a funnel: speech goes in and text comes out. Between those two points, the audio passes through a sequence of stages, each of which can add to the overall latency.

Below is a breakdown of the pipeline along with common sources of latency and ways to reduce them:

1️⃣ Input 

The input stage captures speech from a microphone. This might be managed by the operating system or, in web applications, by the browser.

What can cause latency:

  • Poor microphone hardware or drivers

  • High CPU load affecting audio capture

  • Delay in chunking audio due to large buffer size

How to reduce it:

  • Use smaller audio chunks (e.g., 200–250 ms) for faster feedback

  • Optimize device-level audio capture

  • Minimize OS or browser overhead by using low-latency audio APIs like Web Audio or WebRTC

2️⃣ Encoding/Pre-processing 

After capture, the system encodes raw audio into a transmission-friendly format such as Opus or FLAC.

What can cause latency:

  • Slow or inefficient encoding libraries

  • Encoding large chunks at once instead of streaming

  • Extra processing like noise suppression or resampling

How to reduce it:

  • Use fast, lightweight codecs optimized for real-time (e.g., Opus)

  • Stream audio in parallel with encoding instead of waiting for the full chunk

  • Preconfigure the microphone input to match the target model specs (e.g., 16 kHz mono) to avoid resampling

3️⃣ Transport

The system transmits encoded audio to the STT model over the network.

What can cause latency:

  • High network round-trip time (RTT)

  • Unstable or slow user internet connection

  • Using polling HTTP instead of a real-time protocol

How to reduce it:

  • Use WebSockets or gRPC streaming for real-time audio transmission

  • Deploy the STT service in a region close to the user

  • Ensure users are on stable, low-latency networks where possible

4️⃣ Inference

The STT model receives the audio and transcribes it into text.

What can cause latency:

  • Large, complex models with high computational load

  • CPU-only inference on the server

  • Cold starts in serverless or autoscaled deployments

How to reduce it:

  • Use optimized models built for real-time performance

  • Run inference on GPUs or specialized accelerators

  • Keep servers warm to avoid cold starts, or use model streaming where transcription starts before full audio is received

5️⃣ Post-processing

The system formats the raw transcription, adding punctuation, casing, speaker labels (diarization), and more.

What can cause latency:

  • Complex formatting pipelines

  • Language model passes for punctuation restoration or diarization

  • Processing full segments instead of streaming words

How to reduce it:

  • Use lightweight or integrated postprocessing tools

  • Perform postprocessing incrementally or in a streaming fashion

  • Offload formatting tasks to the client if feasible

6️⃣ Output

The client device receives the post-processed transcipt.

What can cause latency:

  • Delays in pushing the response from server to client

  • Packaging too much data into one response

  • Delayed rendering in the front-end

How to reduce it:

  • Stream partial transcriptions as they are generated

  • Use lightweight data structures for responses

  • Optimize UI to render text as it arrives, not wait for full sentences

How Do You Measure Latency in a Speech-toText (STT) API? 

Measuring latency in an STT API is not always straightforward. This is because latency can stem from multiple components, each influenced by factors like network speed, server response time, and model inference time. These components can vary from one interaction to the next.

When measuring latency in STT systems, it's useful to break it down into two parts:

  • Non-transcription latency: This includes all the time spent outside the actual STT conversion, for example, the time it takes for the request to travel to the server and for the server to respond.

  • Transcription latency: This is the time it takes the STT model to actually process the audio and generate the transcription.

Measuring non-transcription latency

Non-transcription latency can be measured using a simple curl command to check how long it takes to establish a connection with the STT API (in this case, Deepgram's server):

This returns a result like:

This means it took 16.9 milliseconds (0.016930 secs * 1000 ms/s) to roundtrip Deepgram’s server and back. (Keep in mind that this value may vary depending on your network speed, your CPU workload, and your proximity to Deepgram’s servers.)

How to measure total latency?

To measure total latency, that is, how long it takes from sending the audio to receiving a transcription, we simulate a real-time stream using a Python script. The script streams a WAV audio file to Deepgram and records how long it takes to receive transcriptions in response.

You can use this sample audio file for testing:

➡️ Sample audio

You can download the script here:

📄 Measure STT latency script

Or you can copy it directly from the code snippet below and save it in a file called latency.py:

The script only has one dependency, and that is the websockets library. You can install it via:

Then you can run the script, using the command:

You should get an output like:

Calculating transcription latency

We can now compute the transcription latency using the formula:

From the example:

  • Total latency = 0.674 seconds

  • Non-transcription latency = 0.111 seconds

So:

This means the STT model takes approximately 563 milliseconds to generate a transcription on average. This shows that the model’s speed plays the most significant role in overall latency. That’s why you need a fast model, and that is where Nova-3 comes in.

🚀 Nova-3 streaming latency ≃ 300 ms.  Try it Free →

How Nova-3 Slashes Inference Latency

Nova-3 is the latest in Deepgram’s line of speech to text models. It combines low WER, real-time multilingual transcription, and cost efficiency, all while maintaining the blazing speed that made Nova-2 stand out.

When Nova-2 launched, it outpaced the competition with unmatched speed. Nova-3 builds on that foundation, introducing significant accuracy gains while retaining real-time performance. It is also the first model in its class to support multilingual transcription in real time, setting a new industry standard.

In benchmark tests, Nova-3 achieved a median WER of 6.84% on real-time audio streams from diverse, real-world datasets, a 54.2% improvement over the next-best alternative at 14.92%. By comparison, Nova-2 had just an 11% lead over similar models. This leap means Nova-3 doesn’t just improve accuracy, it redefines it.

With its sharp drop in inference latency and boost in transcription quality, Nova-3 delivers the best balance between speed and accuracy. Whether you're building virtual assistants, voice analytics tools, or real-time support systems, it gives you the edge.

Conclusion and Next Steps: Getting Started with Building Livestreaming Transcription Applications

Building a livestreaming transcription application can often feel overwhelming. From setting up microphone access on desktop environments to configuring WebSocket connections, there are quite a few moving parts. 

That's precisely why we developed a basic starter kit, designed to assist you in quickly establishing and developing transcription apps without any complications.

Once cloned, all you need to do is install portaudio and get your Deepgram API key. Then install the Python dependencies with:

The starter kit supports multiple input options, so you can stream audio in real time from:

  • A local file

  • Your microphone

  • A remote audio stream

Stream from a file

To stream a local audio file using the test_suite.py script:

This simulates real-time streaming by sending the file in chunks, as if it were being spoken live. However, for true real-time testing, it's best to use your microphone.

Stream from your microphone

To stream directly from your mic: python test_suite.py -k YOUR_DEEPGRAM_API_KEY -i mic

Make sure you have portaudio installed for microphone access.

Stream from a remote resource

You can also stream from a live audio source over the internet:

This will transcribe the BBC World News in real time as it plays. It works with any live-streamed audio source.

Conclusion: Understanding and Reducing Latency in Speech-to-Text APIs

Speed is one of the most important metrics when it comes to real-time transcription; it cannot be overlooked. The goal is simple: keep your latency as low as possible. 

While some parts of the latency stack might be out of your control, like the workload on your CPU when capturing audio or fluctuations in your network speed, there’s one thing you can control: the model you choose.

Choose a model that doesn’t force you to trade off speed for accuracy. Choose a model that gives you both. That model is Nova-3.

Have questions or need help selecting a model for your specific use-case? Our product experts can guide you directly.

👉 Talk to a Product Expert

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.