Article·AI Engineering & Research·Jun 26, 2025
10 min read

Deepgram vs OpenAI vs Google STT: Accuracy, Latency, & Price Compared

Which speech-to-text AI reigns supreme: Deepgram, Google, or OpenAI? Check out this article to see what the data say!
10 min read
Brad Nikkel
By Brad NikkelAI Content Fellow
Updated
Published

Start building an application that harnesses speech-to-text (STT), and you'll likely find yourself weighing the pros and cons of Deepgram’s, OpenAI’s, and Google’s STT models. This decision could be the crux of your app because it impacts your development, your users' experience, and, ultimately, your application's success. What’s the “best” STT API? Frustrating as it is, the answer is, “It depends.” Deepgram’s, OpenAI’s, and Google’s models all possess unique trade-offs based on your specific app’s requirements. Still, to get an overall sense of how each performs relative to the other, it’s still helpful to look at providers’ claimed performance metrics and third-party benchmarks, which is what we’ll do in this article.

TL;DR

  • Accuracy: OpenAI claims their models lead in accuracy tests, particularly for certain languages. Deepgram claims their models handle specialized terminology and challenging audio conditions well. Accuracy results, however, vary significantly in different benchmarks.

  • Speed: Deepgram dominates speed benchmarks, especially at streaming. Stream processing trades some accuracy for speed for all providers.

  • Price: Deepgram is the cheapest at $4.30 per 1000 minutes (Nova-3). OpenAI is considerably pricier at $6.00 per 1000 minutes (Whisper Large-v2 and Transcribe 4o), and Google is far costlier at $16.00 per 1000 minutes (Chirp 2). Deepgram’s and Google’s enterprise volume discounts can significantly reduce prices.

The Main STT Metrics That Matter

Great STT is about more than accurately mapping audio to words. The key metrics for picking an ideal STT model for your specific application are

  • Transcription quality: STT accuracy metrics measure how well audio clips match their transcriptions, which helps you estimate how usable transcribed text might be for downstream processes. Accuracy is often measured through Word Error Rate (WER) and sometimes character error rate (suited for multilingual datasets). 

  • Speed: Two key parts of STT speed are latency (delay until you receive first results) and throughput (how quickly large volumes of audio get processed). Other STT speed metrics, like time-to-first token or tokens processed per second, can also be useful. Real-time applications, like subtitle apps, often demand low latency, while other applications, say podcast transcribers, might allow for batched transcription that prioritizes throughput.

  • Cost: STT is often priced per token. Some providers offer volume discounts (e.g., Deepgram and Google).

Of course, benchmarks don't provide a complete picture, so developers should also take into account other factors. For example, the presence or lack of some of the following might include or exclude specific models as candidates for your application:

  • Model-Specific Features: Specialized capabilities like speaker identification (diarization), language detection, custom vocabulary support, etc.

  • Business factors: Cost, deployment options, and integration complexity (e.g., API designs, SDKs, deployment options, etc.), and privacy features (e.g., SOC-2, HIPPA, GDPR, etc.) all affect the total investment.

Some metrics are interdependent variables; boost one, and you blunt another. Accuracy and speed are a strong example of this. So view things as trade-offs as we investigate the performance of Deepgram’s, OpenAI’s, and Google’s models across different STT metrics. If you’re already familiar with WER and latency (and their limitations), feel free to skip ahead to the benchmarks. Otherwise, let’s look closer at how STT accuracy is typically measured and this method’s limitations.

Word Error Rate as a Measure of Accuracy

The industry usually measures STT accuracy with Word Error Rate (WER), which calculates the difference between transcribed text and the reference audio like this:

WER = (substitutions + insertions + deletions) / total reference words

Lower WER indicates higher accuracy. A 10% WER, for example, means one in ten words is incorrect (substituted), added (inserted), or missing (deleted). While useful, WER can be misleading because some WERs consider formatting while other WERs normalize away certain errors, and among different normalizing WERs, some normalize different things.

In a study of multiple STT providers, Kuhn et al. (2024) found that applying extensive text normalization significantly reduced the average WER across all the vendors that they tested, bringing WERs from 21.83% (unnormalized) to 10.16% (normalized). Voice note app developer Voicewriter.io, for example, found that relying solely on raw WER overlooked punctuation, capitalization, number formatting (e.g., "two" vs. "2"), and minor spelling variations (e.g., "color" vs. "colour"), all important nuances for their app. To address this, they made their own test, which we’ll look at later. Generally, developers should decide between using

  • Formatted WER: Calculates errors including differences in punctuation, capitalization, etc. Consider using formatted WER for applications where a text’s immediate presentation matters for the user (e.g., a note-taking app). Also, consider, though, that LLMs can clean up poorly formatted transcriptions rather cheaply and accurately.

  • Unformatted (Normalized) WER: Calculates errors after standardizing the text, typically by removing punctuation, lowercasing everything, converting numerals to words (or vice versa), normalizing regional spellings, etc. This measure focuses on core word recognition.

It helps to see an example of this. Here’s a quick one

  • Reference Text: "Okay, let's book flight one-oh-one for $350.50 departing Tues., Aug 20th."

  • Transcribed Text: "ok, lets book flight 101 for three hundred fifty dollars and 50 cents departing tuesday august twentieth"

For the above example, formatted WER would be higher (compared to raw WER) because the capitalization, punctuation, number/currency formatting, and abbreviation/suffix differences between the reference and transcription are plenty. Formatted WER would thus convey this transcription's poor ability to capture structure (in addition to meaning). But unformatted WER, after normalizing both strings, would only penalize minor differences, resulting in a lower WER (compared to formatted WER) that measures underlying word recognition.

Not considering the type of WER used can devolve into a "comparing apples to oranges" scenario. For instance, Artificial Analysis specifies that their STT benchmarks use unformatted WER, focusing on raw word accuracy, a helpful detail. OpenAI, on the other hand, recently benchmarked leading STT models using WER and the FLEURS dataset, claiming that their own GPT-4o Transcribe tended to earn the lowest WER. But OpenAI hasn’t yet specified their normalization method nor whether they used formatted or unformatted WER, creating some confusion for those attempting to replicate their results and making it tough for us to compare their results with other benchmarks. Voicewriter.io took the ideal approach, measuring formatted and unformatted WER. 

STT Benchmark Limitations

Different WER measures are one of several reasons that benchmarking STT models is both tricky and faulty. Some other reasons stem from 

  1. Vendors using closed-weight models that are frequently updated

  2. Many popular test sets use short audio clips, which creates benchmarks that don’t judge STT models' performance on longer audio clips or on formatting.

  3. Some test sets using uncharacteristically clean audio or outdated linguistic styles that can skew results

  4. Benchmarks often vary significantly by dataset, language, and transcription mode.

Voicewriter.io, for example, found that gauging robust performance for their app’s use case required testing on diverse conditions, including audio with ample background noise, non-native accents, and domain-specific jargon. Kuhn et al’s (2024) multi-model study found that

  • Each model’s performance varied significantly across different audio clips (e.g., seven different vendors held first place on at least one of the 30 clips tested).

The takeaway? Never trust a lone benchmark. Instead, consider many benchmarks and test all candidate STT models on audio representative of your specific use case.

Latency: Sometimes Fast STT Matters

Your app’s use case will also determine how important speed is. Measuring STT speed involves considering several factors, but, typically, the most impactful is whether you process your audio files at once and then wait for the results (batch processing) or process them in near real time (streaming).

Batch versus Stream Processing

These two processing modes contain tradeoffs, namely accuracy and speed. 

  • Batch processing: An entire audio file (or set of audio files) is processed as a complete unit, which is usually more accurate because it gives models more context but is also slower because it requires waiting for the full file (or set of files) to be processed before returning the transcriptions. Generating transcripts for podcasts is a use case that can often wait for batch processing.

  • Stream processing: Audio is processed in small chunks, which typically yields lower accuracy (relative to batch processing) but transcribes in near real-time. Captioning live videos is a use case that typically requires stream processing.

A Few More STT Latency Factors

Three more latency metrics that developers sometimes consider are

  1. Time to First Token (TTFT): How quickly the first word appears (an important UX for interactive applications).

  2. Final Result Delay: When the complete transcription becomes available (important for workflows downstream of the transcription)

  3. Processing Speed Factor: Often seconds of audio transcribed per second of waiting (affects throughput for apps processing large volumes of audio)

The "speed factor" metric reported by Artificial Analysis’ benchmarks, which you’ll see soon, measures this last point — audio duration divided by processing time. For them, a “speed factor” of 100x, for example, means 10 minutes (600 seconds) of audio is processed in just 6 seconds.

STT Speed versus Accuracy Tradeoffs

Fast STT is often at odds with better STT accuracy or formatting. 

  1. Speed versus Accuracy: Kuhn et al.’s (2024) study of several STT providers found that streaming transcription APIs typically suffer significantly higher WERs compared to batch processing from the same provider and model (batch transcription was 9.37 % WER; streaming transcription was 10.9 % WER). This advantage likely stems from streaming mode predicting with less context than batch processing affords. Independent tests, including voicewriter.io’s, also found that streamed transcriptions tend to produce more formatting and sentence fragmentation errors compared to batch processing.

  2. Diarization versus Speed: Diarization (identifying different speakers in a conversation) often slows processing speed. Deepgram's internal tests, for example, clocked Nova-3 as processing audio up to 40 times faster than competitors when diarization was enabled.

Head-to-Head: Benchmarking Deepgram vs. OpenAI vs. Google

Keep the above caveats in mind as we now look at some benchmarks. We will alternate between discussing independent benchmarks and provider benchmarks, highlighting the strengths and shortcomings of each as we proceed. First up is a regularly updated, third-party STT benchmark.

Artificial Analysis’ STT Benchmarks (using Raw WER, Batch Processing,  Common Voice Test Set, and Short Audio Clips)

The independent benchmark organization Artificial Analysis routinely tests 27 STT models (at the time of this article) on thousands of short clips from the public Mozilla Common Voice v16.1 dataset using batch processing and raw WER (normalized text via lowercasing and punctuation removal), which you’ll recall measures core word recognition by ignoring format. 

We’ll later see a few benchmarks that developed test sets with longer average audio clips (often greater than a minute) because their creators found many existing benchmarks used test audio clips too unrealistically short for their use cases (e.g., the Common Voice test set that Artificial Analysis used is comprised of many clips less than 10 seconds long). When we filter Artificial Analysis’ results to show only Deepgram, OpenAI’s, and Google’s STT models, we get the following:

You can see that Artificial Analysis ranked Deepgram’s Nova-3 as the fastest (and cheapest), OpenAI’s GPT-4o Transcribe as the most accurate (though significantly costlier and slower than Nova-3), and Google’s Chirp 2 as the most expensive and second most accurate. Below are the individual metrics.

If your app tends to use audio clips longer than 10 seconds or if your app requires streaming STT, don’t put much stock in Artificial Analysis’ benchmarks without first developing your own test set with longer audio clips (ideally around your average length that your app processes) and testing your audio with streaming STT models.

OpenAI’s Benchmarks (Likely Using Raw WER, Batch Processing, Multilingual FLEURS Test Set and Short Audio Clips)

Many older benchmarks compare Deepgram vs. Whisper, but you probably noticed a newer OpenAI model above. That’s because OpenAI recently released GPT-4o Transcribe along with a blog post comparing GPT-4o Transcribe against OpenAI’s older model (Whisper Large v2) and competitors using the 102-language FLEURS dataset whose standard scoring pipeline runs reference and hypothesis texts through a “Whisper-normalizer” that strips punctuation, lowercases words, and normalizes numbers (i.e., raw WER). OpenAI’s initial write-up doesn’t specify whether they borrowed FLUERS’ batch processing and normalization method, but if they did, they would have used raw WER and tested on ~283 hours of audio (for an average of ~2.8 hours of evaluation audio per language). 

Akin to Common Voice in audio length, FLUERS test audio clips are also shorter, never exceeding 30 seconds, with most languages’ average clip duration well under 15 seconds (e.g., English at 9.79 seconds, Russian at 11.46, and Mandarin Chinese at 10.37), so if your app tends to ingest audio clips longer than 10 seconds, run your own tests on longer clips. 

OpenAI found that GPT-4o Transcribe consistently scored lower WER than Deepgram’s Nova-3 and Google’s Gemini 2.0 Flash across most languages.

Voicewriter.io’s Custom Benchmark (Using Batch & Stream Processing, Raw & Formatted WERs, and Custom, Long Audio Clips)

Frustrated at some popular, broad benchmarks’ shortcomings for their needs, Voicewriter.io designed their own STT benchmark suited for their notetaking app by

  1. Developing a custom dataset composed of longer clips (1-2 minutes) of varied conditions (clean TED talks, noisy TED talks, accented Wikipedia readings, and highly technical specialist text-to-speech audio), since many existing benchmarks used unrealistically short audio clips (e.g., Common Voice that Artificial Analysis used and FLUERS that OpenAI used) or audio clips so ubiquitous that models likely unintentionally trained on them.

  2. Testing both raw WER (normalized) and formatted WER (unnormalized), because formatted transcriptions mattered for their app (some developers prefer treating raw transcription as one step and formatting as a second step, often by feeding the raw transcription to an LLM).

  3. Testing batch and streaming performance.

Voicewriter.io’s overall batched raw and formatted WERs with prices are below:

Voicewriter.io’s Mean Raw and Formatted WER (lower = better) for Batch-Processed Overall Audio (clean, noisy, accented, and specialist audio)

Note how for all the models that voicewriter.io tested on batch-processed audio, the raw WERs were consistently lower than the formatted WERs, confirming the pattern others have noted. Below, in the streaming results, you'll notice another pattern.

Voicewriter.io’s Mean Raw and Formatted WER (lower = better) for Streaming-Processed Overall Audio (clean, noisy, accented, and specialist)

Voicewriter.io’s tests also confirm others’ findings that streaming transcription tends to worsen WER compared to batch processing.

Their tests allowed Voicewriter.io to assess models’ varying performance on formatted versus raw WER, stream versus batch processing, and clean versus noisy versus accented versus specialist (highly technical) speech. Using batch processing and testing overall audio (testing on clean, noisy, accented, and specialist speech) with formatted WER, OpenAI's GPT-4o Transcribe led, followed by Google's Gemini models, with Deepgram further down.

Voicewriter.io’s Mean Raw WER (lower = better) for Specialist (Highly Technical) Audio

Voicewriter.io’s category breakdowns reveal models’ varying strengths in different areas. For instance, for specialist audio and raw WER, Deepgram achieved the lowest WER (5.8%), ahead of Gemini Pro (5.9%) and GPT-4o (6.7%), suggesting Deepgram's advantage with highly technical language where formatting isn't critical. Voicewriter.io also found Deepgram had the lowest WER in streaming mode. Voicewriter.io’s tests offer a doable testing approach that developers could repurpose to test their app. Keep in mind, though, that voicewriter.io’s total test set audio was around 30 minutes long.

Deepgram’s Internal Benchmarks

Deepgram's internal benchmarks for Nova-3 used an approach focused on enterprise use cases, testing Nova-3 on what appears to be a significantly larger and more diverse dataset than what the other test sets we’ve discussed, totaling 81.69 hours (4901.4 minutes) across 2,703 files, including audio from nine challenging domains (e.g., air traffic control, medical, finance, drive-thru, etc.). Like voicewriter.io, Deepgram’s test audio segments were longer than some benchmarks (i.e., Common Voice and FLUERS) and thus more realistic, with an average of 1.81 minutes per clip. Deepgram’s tests found that Nova-3 achieved a 6.84% median WER for streaming and 5.26% for batch processing.

Deepgram’s Internal Test of Streaming Word Error Rate (WER) (Note: OpenAI’s Whisper Large v2 doesn’t have stream processing and OpenAI’s GPT-4o Transcribe was not yet released.)

Deepgram’s Internal Test of Batched Word Error Rate (WER) (Note: OpenAI’s GPT-4o Transcribe was not yet released)

STT Benchmark Takeaways

The variation you observed across the above benchmarks vividly illustrates that

  • Test Datasets Matter: Performance measurements heavily depend on the type of audio used (short vs. long clips, clean vs. noisy, general vs. specialist, accents vs. no accents, etc.).

  • Metrics Matter: Raw (normalized) WER focuses on core word accuracy, while formatted (unnormalized) WER reflects readability and presentation. Expect that the choice of which one you test will significantly change your results.

  • Provider Claims Often Differ From Third-Party Tests: Internal benchmarks (like those shared in Deepgram's Nova-3 announcement or in OpenAI's Transcribe announcement) often use different types and amounts of test data and different test methods than independent benchmarkers, so consider both and each actor’s interests.

  • Internal Testing is Vital: Agreement between different entities' benchmarks is shaky, meaning there’s no single "best" STT API. Test candidate STT APIs with audio representative of your specific use case and evaluate based on the metrics most important to you (e.g., raw accuracy, formatting, speed, cost, diarization, technical vocabulary, etc.).

Domain-Specific Performance

Different domains’ average audio can vary considerably, meaning a model’s performance can shift dramatically from

  • Industry-specific terminology (e.g., Medical and legal transcription often requires specialized and updatable vocabulary, a feature that Deepgram provides)

  • Audio quality (e.g., the average background noise, microphone distance, overlapping speech, etc. varies by domain.

  • Speaker variety (Average accents, dialects, speech impediments, non-native speakers, etc. can also vary by domain)

Because of this, providers often gear their benchmarks to specific domains. Deegram’s and Google's specialized models (e.g., medical) provide domain-specific advantages. Independent benchmarks sometimes focus on these. For example, AIMultiple made a small healthcare-focused STT test using medical terminology and patient interaction audio and found that Deepgram’s Nova-2 best handled their specific tests. Similarly, voicewriter.io’s benchmark found Deepgram’s Nova-3 earned the lowest raw WER on highly technical speech. Before crafting your own domain-specific tests, check for existing benchmarks for audio related to your domain and either use their results as a guide or adapt their test to your data.

Cost Considerations

The last and easiest metric to evaluate is price. You don’t need a benchmark for this, of course, but it’s helpful to compare what speed or accuracy you get for what you pay, something that Artificial Analysis helpfully plotted.

Speed versus Price

Deepgram's Nova-3 topped Artificial Analysis' speed versus price plot (~160 audio file seconds transcribed per second of processing at $4.30 per 1000 minutes of audio transcribed). OpenAI's Whisper V2 Large and GPT-4o-Transcribe are considerably costlier and slower (~35 and ~40 audio file seconds transcribed per second of processing, respectively, at $6 per 1000 minutes of audio transcribed), and Google's Chirp 2 was much costlier than Deepgram or OpenAI (~70 audio file seconds transcribed per second of processing at $16 per 1000 minutes of audio transcribed).

Accuracy versus Price

Artificial Analysis found Google’s Chirp 2 to have the worst accuracy for the price (9.8% WER on Common Voice v16.1 at $16 per 1000 minutes of audio transcription). GPT-4o-Transcribe earned 8.9% WER at $6 per 1000 minutes of audio transcribed, and Nova-3 scored 12.8% WER at $4.3 per 1000 minutes of audio transcribed.

Volume Discounts

Usage can affect costs. Deepgram and Google both offer tiered pricing, giving customers who purchase over specific thresholds cheaper per-minute prices. Google, for example, charges customers who process at least 2 million minutes of audio per month $4.00 per 1000 minutes (compared to their standard $16.00 per 1000 minutes). Deepgram's higher use pricing is $3.60 per 1000 minutes (compared to their standard $4.30 per 1000 minutes).

Hidden Costs

Beyond per-minute pricing, don’t forget to consider

  • Developer Time: Integration complexity and maintenance costs (more complex APIs require more developer hours)

  • Infrastructure: Self-hosting incurs hardware and energy costs (particularly relevant for on-premises deployments)

  • Error Correction: Human-in-the-loop machine transcription incurs labor costs for fixing transcription errors (e.g., if you transcribe thousands of hours of audio per month, paying twice as much per minute for a 3% WER improvement might be justified if it saves enough human review time and labor costs).

Choose the Best STT for Your Application

How do you actually translate all these metrics into a decision? Weight metrics based on their importance to your app’s specific requirements. You might value streaming over batch processing, Estonian over English, cost over accuracy, or structured accuracy over raw accuracy. Or it might be something else entirely that tips the scales. 

General Best-Fit Scenarios

There’s no way around testing for yourself, but here are some rough, vendor-specific rules of thumb. For voice bots, live captioning, and other real-time apps, Deepgram's streaming speed, accuracy, robustness, and cost combo stand out. For batch transcription (e.g., of podcasts or meetings), OpenAI's newer GPT-4o Transcribe claims accuracy at a reasonable price, particularly for certain languages. Google's Chirp2 might work well for developers already tightly integrated into Google Cloud (though using Deepgram’s or OpenAI’s STT APIs is simple enough to negate some of Google’s ecosystem advantage) but comes with high costs.

Hybrid Approaches

You might find that using multiple providers offers the best results. Your app, for example, might benefit from harnessing

  • Different providers for different languages (especially considering often uneven performance across languages shown in benchmarks like FLEURS)

  • A/B testing to optimize for specific audio categories (e.g., Deepgram for long, noisy, medical-related and noisy audio; OpenAI’s Transcribe for short, clean, general audio)

This approach requires more complexity but can leverage each provider’s strengths.

What to Evaluate in a STT Model

Different STT providers focus on advancing different aspects of their models. Benchmarks can help you gauge different models' general relative strengths and weaknesses, but they might not apply to your specific average audio, so test Deepgram, OpenAI, and Google’s STT models on audio representative of your specific use case. When doing this, consider

  • Acoustic challenges of your audio: Different models handle background noise, accents, and audio quality differently.

  • Latency requirements: Do you need real-time results (streaming), or is batch processing acceptable?

  • Acceptable accuracy threshold: Figure out an acceptable WER for your application 

  • Budget constraints: Consider per-minute costs and expected total volume

  • Deployment requirements: Do you have specific cloud, on-premise, or data residency needs?

  • Special features needed: Do you require diarization (speaker identification), real-time redaction, or multilingual support?

Never accept vendors’ claims (including ours) at face value. Minimally, systematically evaluate by

  1. Selecting representative audio samples reflecting your application’s use case

  2. Processing each through candidate API

  3. Compare each model’s accuracy, speed, and formatting quality (weighing each metric by how important it is for your app)

  4. Estimate the full cost per model (API costs + average human review or correction needed)

It’s not easy to do, and it’s something you’ll need to routinely revisit as providers release new models, but this type of comprehensive evaluation is worth the effort because it ensures you’ll pick the STT API that best serves your specific app, delivering a balance of accuracy, speed, and cost for your users. Best of luck on your testing and decision, and reach out to Deepgram if you have any questions about how we benchmarked Nova-3 against Google’s, OpenAI’s, and others’ STT models.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.