Understanding and Reducing Latency in Speech-to-Text APIs

By Stephen Oladele

Last Updated

Aug 8, 2025

There are several key factors you consider when selecting an API for speech-to-text (STT) models. You have to check for accuracy, speed, cost, and speed.

Yes, you read speed twice because that’s how important it is in any STT API. A good rule of thumb is: the smaller the latency, the higher the adoption of your product. This leads to a better user experience!

Latency is one of those rare cases where less is more. But keeping it low is easier said than done. A typical STT API is full of moving parts: transcription delays, network overhead, API response times, any of which can become a bottleneck.

Of course, there’s often a tradeoff. Most STT providers try to reduce latency by sacrificing accuracy. It’s the classic compromise: choose between speed and accuracy.

In this article, we’ll break down the major sources of latency in STT APIs and explore practical strategies to minimize them. You’ll also see how Nova-3 tackles the speed versus accuracy dilemma head-on.

⏩ TL;DR

💻 You can clone the starter kit from GitHub: GitHub - deepgram/live-streaming-starter-kit

Why Is Latency the Speech-to-Text (STT) KPI Nobody Can Ignore?

To create a fantastic user experience (UX) in a voice application, you need to reduce latency. Latency is simply the amount of time it takes for an action to occur. In the case of an STT model or application, it refers to how long it takes for the model to transcribe the user’s speech into text.

Latency becomes especially critical in real-time transcription scenarios, where users expect to see their words appear on screen as they speak, or during live captioning. The goal is to minimize the delay between spoken words and the transcribed text.

If latency is high, it will almost certainly affect the perceived performance of the STT system. While other metrics like accuracy and cost are important, latency is the most noticeable and the least forgivable.

For example, a model with a high Word Error Rate (WER) might occasionally misinterpret words during live transcription, but many users will miss the mistake. What they won’t miss is a long delay. Slow transcription breaks the flow, disrupts the experience, and draws attention to itself.

That’s why latency remains one of the most important key performance metrics when evaluating or selecting a speech-to-text model.

➡️ Recommended Guide: Why Speed is the Biggest Hurdle to Building Voice AI Agents

What are the sources of latency in STT APIs?

The answer depends on several factors, including the technology stack, model architecture, network infrastructure, and all the steps that happen between when a user speaks and when the transcription is returned.

But here are major sources where latency tends to creep in:

Network: If you're using an STT model over a network, latency can occur during the transmission of audio data to the server and the return of the transcribed text to the user. This is especially noticeable in cloud-based systems.
Encoding: When the user's speech is captured in raw form, it must first be encoded into a format suitable for transmission (e.g., compressed audio). This process takes time and can introduce a delay.
Transcription latency: This is the core processing time, how long it takes the model to transcribe a chunk of audio into text.
Transcription formatting: After the model generates the raw text, it may need to be formatted, adding punctuation, casing, or timestamps, before the client get the transcript. This step can introduce small but noticeable delays.
Buffer size: In real-time transcription, the app may stream the audio in chunks rather than all at once. The size of each chunk (or buffer) directly affects latency. Larger buffers mean fewer interruptions but slower updates; smaller buffers provide faster feedback but may sacrifice context.

All of these factors combine to form the total latency. The saying "a chain is only as strong as its weakest link" applies perfectly to latency in STT systems.

If even one source introduces a significant delay, it doesn't matter how fast the others are; the overall system will still feel slow.

Latency Funnel Breakdown: Where Do The Milliseconds Hide?

An STT API is made up of several components, each with its own runtime. The goal is to minimize the runtime of each stage, ideally bringing it down to just a few milliseconds, because in real-time applications, every millisecond counts.

The latency tunnel represents the journey from speech to text, with each stage adding its own delay to the total response time.

You can think of the STT pipeline as a funnel: speech goes in and text comes out. Between those two points, the audio passes through a sequence of stages, each of which can add to the overall latency.

Below is a breakdown of the pipeline along with common sources of latency and ways to reduce them:

1️⃣ Input

The input stage captures speech from a microphone. This might be managed by the operating system or, in web applications, by the browser.

What can cause latency:

Poor microphone hardware or drivers
High CPU load affecting audio capture
Delay in chunking audio due to large buffer size

How to reduce it:

Use smaller audio chunks (e.g., 200–250 ms) for faster feedback
Optimize device-level audio capture
Minimize OS or browser overhead by using low-latency audio APIs like Web Audio or WebRTC

2️⃣ Encoding/Pre-processing

After capture, the system encodes raw audio into a transmission-friendly format such as Opus or FLAC.

What can cause latency:

Slow or inefficient encoding libraries
Encoding large chunks at once instead of streaming
Extra processing like noise suppression or resampling

How to reduce it:

Use fast, lightweight codecs optimized for real-time (e.g., Opus)
Stream audio in parallel with encoding instead of waiting for the full chunk
Preconfigure the microphone input to match the target model specs (e.g., 16 kHz mono) to avoid resampling

3️⃣ Transport

The system transmits encoded audio to the STT model over the network.

What can cause latency:

High network round-trip time (RTT)
Unstable or slow user internet connection
Using polling HTTP instead of a real-time protocol

How to reduce it:

Use WebSockets or gRPC streaming for real-time audio transmission
Deploy the STT service in a region close to the user
Ensure users are on stable, low-latency networks where possible

4️⃣ Inference

The STT model receives the audio and transcribes it into text.

What can cause latency:

Large, complex models with high computational load
CPU-only inference on the server
Cold starts in serverless or autoscaled deployments

How to reduce it:

Use optimized models built for real-time performance
Run inference on GPUs or specialized accelerators
Keep servers warm to avoid cold starts, or use model streaming where transcription starts before full audio is received

5️⃣ Post-processing

The system formats the raw transcription, adding punctuation, casing, speaker labels (diarization), and more.

What can cause latency:

Complex formatting pipelines
Language model passes for punctuation restoration or diarization
Processing full segments instead of streaming words

How to reduce it:

Use lightweight or integrated postprocessing tools
Perform postprocessing incrementally or in a streaming fashion
Offload formatting tasks to the client if feasible

6️⃣ Output

The client device receives the post-processed transcipt.

What can cause latency:

Delays in pushing the response from server to client
Packaging too much data into one response
Delayed rendering in the front-end

How to reduce it:

Stream partial transcriptions as they are generated
Use lightweight data structures for responses
Optimize UI to render text as it arrives, not wait for full sentences

How Do You Measure Latency in a Speech-toText (STT) API?

Measuring latency in an STT API is not always straightforward. This is because latency can stem from multiple components, each influenced by factors like network speed, server response time, and model inference time. These components can vary from one interaction to the next.

When measuring latency in STT systems, it's useful to break it down into two parts:

Non-transcription latency: This includes all the time spent outside the actual STT conversion, for example, the time it takes for the request to travel to the server and for the server to respond.
Transcription latency: This is the time it takes the STT model to actually process the audio and generate the transcription.

Measuring non-transcription latency

Non-transcription latency can be measured using a simple curl command to check how long it takes to establish a connection with the STT API (in this case, Deepgram's server):

!curl -sSf -w "Connect time: %{time_connect}\n" -so /dev/null https://api.deepgram.com/v1/health

This returns a result like:

Connect time: 0.016930

This means it took 16.9 milliseconds (0.016930 secs * 1000 ms/s) to roundtrip Deepgram’s server and back. (Keep in mind that this value may vary depending on your network speed, your CPU workload, and your proximity to Deepgram’s servers.)

How to measure total latency?

To measure total latency, that is, how long it takes from sending the audio to receiving a transcription, we simulate a real-time stream using a Python script. The script streams a WAV audio file to Deepgram and records how long it takes to receive transcriptions in response.

You can use this sample audio file for testing:

➡️ Sample audio

You can download the script here:

📄 Measure STT latency script

Or you can copy it directly from the code snippet below and save it in a file called latency.py:

import argparse
import asyncio
import base64
import json
import sys
import wave
import websockets

# Mimic sending a real-time stream by sending this many seconds of audio at a time.
REALTIME_RESOLUTION = 0.020

async def run(data, key, channels, sample_width, sample_rate):
    """ Connect to the Deepgram real-time streaming endpoint, stream the data
        in real-time, and print out the responses from the server.
    """
    # How many bytes are contained in one second of audio.
    byte_rate = sample_width * sample_rate * channels

    audio_cursor = 0.

    # Connect to the real-time streaming endpoint, attaching our API key.
    async with websockets.connect(
        f'wss://api.deepgram.com/v1/listen?channels={channels}&sample_rate={sample_rate}&encoding=linear16&interim_results=true',
        additional_headers={
            'Authorization': 'Token {}'.format(key)
        }
    ) as ws:
        async def sender(ws):
            """ Sends the data, mimicking a real-time connection.
            """
            nonlocal data, audio_cursor
            try:
                total = len(data)
                while len(data):
                    # How many bytes are in `REALTIME_RESOLUTION` seconds of audio?
                    i = int(byte_rate * REALTIME_RESOLUTION)
                    chunk, data = data[:i], data[i:]
                    # Send the data
                    await ws.send(chunk)
                    # Move the audio cursor
                    audio_cursor += REALTIME_RESOLUTION
                    # Mimic real-time by waiting `REALTIME_RESOLUTION` seconds
                    # before the next packet.
                    await asyncio.sleep(REALTIME_RESOLUTION)

                # A CloseStream message tells Deepgram that no more audio
                # will be sent. Deepgram will close the connection once all
                # audio has finished processing.
                await ws.send(json.dumps({
                    "type": "CloseStream"
                }))
            except Exception as e:
                print(f'Error while sending: {e}')
                raise

        async def receiver(ws):
            """ Print out the messages received from the server.
            """
            nonlocal audio_cursor
            transcript_cursor = 0.
            min_latency = 0
            max_latency = 0
            avg_latency_num = 0
            avg_latency_den = 0
            async for msg in ws:
                msg = json.loads(msg)
                if 'request_id' in msg:
                    # This is the final metadata message. It gets sent as the
                    # very last message by Deepgram during a clean shutdown.
                    # There is no transcript in it.
                    continue
                if msg['is_final']:
                    continue
                # Just a moment ago -- right before we received this message --
                # is when the latency was at its worse. So measure max latency
                # and then update the transcript_cursor
                cur_max_latency = audio_cursor - transcript_cursor

                transcript_cursor = msg['start'] + msg['duration']

                # Since we just received a message, latency is currently at its
                # best.
                cur_min_latency = audio_cursor - transcript_cursor

                # The average latency (as would be measured by a constant probe
                # or by a sampling profiler) is mathematically equivalent to a
                # weighted sum.
                avg_latency_num += (cur_min_latency + cur_max_latency) / 2 * msg['duration']
                avg_latency_den += msg['duration']

                # Update global max/min latencies.
                max_latency = max(max_latency or 0, cur_max_latency)
                min_latency = min(min_latency or 10**6, cur_min_latency)

                print(f'Measuring... Audio cursor = {audio_cursor:.3f}, Transcript cursor = {transcript_cursor:.3f}')

            print(f'Min latency: {min_latency:.3f}')
            print(f'Avg latency: {avg_latency_num / (avg_latency_den or 1):.3f}')
            print(f'Max latency: {max_latency:.3f}')

        await asyncio.wait([
            asyncio.ensure_future(sender(ws)),
            asyncio.ensure_future(receiver(ws))
        ])
        
def parse_args():
    """ Parses the command-line arguments.
    """
    parser = argparse.ArgumentParser(description='Submits data to the real-time streaming endpoint.')
    parser.add_argument('-k', '--key', required=True, help='YOUR_DEEPGRAM_API_KEY (authorization)')
    parser.add_argument('input', help='Input file.')
    return parser.parse_args()

def main():
    """ Entrypoint for the example."
    """
    # Parse the command-line arguments.
    args = parse_args()

    # Open the audio file.
    with wave.open(args.input, 'rb') as fh:
        (channels, sample_width, sample_rate, num_samples, _, _) = fh.getparams()
        assert sample_width == 2, 'WAV data must be 16-bit.'
        data = fh.readframes(num_samples)
    print(f'Channels = {channels}, Sample Rate = {sample_rate} Hz, Sample width = {sample_width} bytes, Size = {len(data)} bytes', file=sys.stderr)

    # Run the example.
    asyncio.get_event_loop().run_until_complete(run(data, args.key, channels, sample_width, sample_rate))
    
if __name__ == '__main__':
    sys.exit(main() or 0)

The script only has one dependency, and that is the websockets library. You can install it via:

pip install websockets

Then you can run the script, using the command:

python3 latency.py -k 'YOUR_DEEPGRAM_API_KEY' interview_speech-analytics_phntpw.wav

You should get an output like:

Channels = 2, Sample Rate = 48000 Hz, Sample width = 2 bytes, Size = 18540124 bytes
Measuring... Audio cursor = 2.580, Transcript cursor = 0.220
Measuring... Audio cursor = 2.580, Transcript cursor = 0.420
Measuring... Audio cursor = 2.580, Transcript cursor = 0.620
Measuring... Audio cursor = 2.580, Transcript cursor = 0.840
...
Min latency: 0.080
Avg latency: 0.674
Max latency: 1.180

Calculating transcription latency

We can now compute the transcription latency using the formula:

Transcription Latency = Total Latency - Non-Transcription Latency

From the example:

Total latency = 0.674 seconds
Non-transcription latency = 0.111 seconds

So:

Transcription Latency = 0.674 - 0.111 = 0.563 seconds

This means the STT model takes approximately 563 milliseconds to generate a transcription on average. This shows that the model’s speed plays the most significant role in overall latency. That’s why you need a fast model, and that is where Nova-3 comes in.

🚀 Nova-3 streaming latency ≃ 300 ms. Try it Free →

How Nova-3 Slashes Inference Latency

Nova-3 is the latest in Deepgram’s line of speech to text models. It combines low WER, real-time multilingual transcription, and cost efficiency, all while maintaining the blazing speed that made Nova-2 stand out.

Inference latency comparison across leading STT providers.

When Nova-2 launched, it outpaced the competition with unmatched speed. Nova-3 builds on that foundation, introducing significant accuracy gains while retaining real-time performance. It is also the first model in its class to support multilingual transcription in real time, setting a new industry standard.

WER comparison highlighting Nova-3’s lead over major STT models, even its predecessor, Nova-2.

In benchmark tests, Nova-3 achieved a median WER of 6.84% on real-time audio streams from diverse, real-world datasets, a 54.2% improvement over the next-best alternative at 14.92%. By comparison, Nova-2 had just an 11% lead over similar models. This leap means Nova-3 doesn’t just improve accuracy, it redefines it.

With its sharp drop in inference latency and boost in transcription quality, Nova-3 delivers the best balance between speed and accuracy. Whether you're building virtual assistants, voice analytics tools, or real-time support systems, it gives you the edge.

Conclusion and Next Steps: Getting Started with Building Livestreaming Transcription Applications

Building a livestreaming transcription application can often feel overwhelming. From setting up microphone access on desktop environments to configuring WebSocket connections, there are quite a few moving parts.

That's precisely why we developed a basic starter kit, designed to assist you in quickly establishing and developing transcription apps without any complications.

💻 You can clone the starter kit from GitHub: GitHub - deepgram/live-streaming-starter-kit

Once cloned, all you need to do is install portaudio and get your Deepgram API key. Then install the Python dependencies with:

pip install -r requirements.txt

The starter kit supports multiple input options, so you can stream audio in real time from:

A local file
Your microphone
A remote audio stream

Stream from a file

To stream a local audio file using the test_suite.py script:

python test_suite.py -k YOUR_DEEPGRAM_API_KEY -i interview_speech-analytics_phntpw.wav

This simulates real-time streaming by sending the file in chunks, as if it were being spoken live. However, for true real-time testing, it's best to use your microphone.

Stream from your microphone

To stream directly from your mic: python test_suite.py -k YOUR_DEEPGRAM_API_KEY -i mic

Make sure you have portaudio installed for microphone access.

Stream from a remote resource

You can also stream from a live audio source over the internet:

python test_suite.py -k YOUR_DEEPGRAM_API_KEY -i http://stream.live.vc.bbcmedia.co.uk/bbc_world_service

This will transcribe the BBC World News in real time as it plays. It works with any live-streamed audio source.

🎙️ Recommended Guide: If you're interested in building a real-time transcription app for the web, check out our step-by-step tutorial: 👉 build-a-real-time-transcription-app-with-react-and-deepgram

Conclusion: Understanding and Reducing Latency in Speech-to-Text APIs

Speed is one of the most important metrics when it comes to real-time transcription; it cannot be overlooked. The goal is simple: keep your latency as low as possible.

While some parts of the latency stack might be out of your control, like the workload on your CPU when capturing audio or fluctuations in your network speed, there’s one thing you can control: the model you choose.

Choose a model that doesn’t force you to trade off speed for accuracy. Choose a model that gives you both. That model is Nova-3.

🚀 Action	🔗 Link
✅ Test Nova-3 with your own audio	🎧 Deepgram API Playground
✅ Spin up a free account (includes $200 free credits for your first tests!)	💳 console.deepgram.com/signup

Have questions or need help selecting a model for your specific use-case? Our product experts can guide you directly.

👉 Talk to a Product Expert

Understanding and Reducing Latency in Speech-to-Text APIs

Table of Contents

Table of Contents

⏩ TL;DR

Why Is Latency the Speech-to-Text (STT) KPI Nobody Can Ignore?

What are the sources of latency in STT APIs?

Latency Funnel Breakdown: Where Do The Milliseconds Hide?

1️⃣ Input

2️⃣ Encoding/Pre-processing

3️⃣ Transport

4️⃣ Inference

5️⃣ Post-processing

6️⃣ Output

How Do You Measure Latency in a Speech-toText (STT) API?

Measuring non-transcription latency

How to measure total latency?

Calculating transcription latency

How Nova-3 Slashes Inference Latency

Conclusion and Next Steps: Getting Started with Building Livestreaming Transcription Applications

Stream from a file

Stream from your microphone

Stream from a remote resource

Conclusion: Understanding and Reducing Latency in Speech-to-Text APIs

Unlock language AI at scale with an API call.

Unlock language AI at scale with an API call.