Dynamic Range Compression for Voice AI

Listen to article10:13

Key Takeaways
What DRC Does in a Voice AI Pipeline
How DRC Works at the Signal Level
Where DRC Sits in a Production Audio Pipeline
Client-Side vs. Server-Side Normalization
When DRC Improves ASR Accuracy and When It Hurts
High-Noise Environments Where DRC Helps
Scenarios Where DRC Degrades Accuracy
How Modern ASR Models Handle Dynamic Range Internally
DRC Settings for Production Voice Applications
Threshold, Ratio, and Attack/Release for Speech
Multiband Compression for Mixed-Frequency Environments
Testing DRC Impact on Your Specific ASR Provider
DRC Across Voice AI Workloads
Contact Center Call Audio
Clinical Voice Documentation
Real-Time Voice Agents
Start With Your Audio, Not Your Settings
Profile Before You Compress
Test With Your ASR Provider
FAQ
What Is DRC in the Context of Voice AI?
Does Deepgram's Speech-to-Text API Require DRC Preprocessing?
Can DRC Improve Transcription Accuracy in Noisy Call Centers?
What Settings Work Best for Real-Time Voice Agents?
How Does DRC Interact With Speaker Diarization?

Listen to article10:13

Dynamic range compression is one of the most misunderstood preprocessing steps in voice AI. Applied correctly, it can improve transcription accuracy in narrow scenarios. Applied incorrectly, it degrades the signal your ASR model needs. Most online resources treat DRC as pure audio engineering theory or a consumer tool setting. They rarely connect it to production word error rate outcomes.

This article gives you a production decision framework. You'll learn when DRC helps ASR accuracy, when it hurts, what parameter ranges to use, and how modern models already handle level variation internally. The short version: you probably don't need DRC, and the widely cited statistics claiming otherwise are fabricated.

Key Takeaways

Key research findings on DRC for voice AI:

No major ASR provider recommends applying external DRC.
The commonly cited "18–23% WER reduction" figures are fabricated. They trace to a LoRA paper unrelated to audio preprocessing.
Aggressive compression strips prosodic cues that diarization and sentiment models rely on.
If you have a documented level-variation problem, use conservative settings: 1.5:1–2:1 ratio, under 6 dB gain reduction.
Always A/B test DRC against unprocessed audio on your specific ASR provider before deploying.

What DRC Does in a Voice AI Pipeline

Most pipelines don't need DRC. Use it only when you've verified a level-variation problem that your model or provider can't already absorb.

How DRC Works at the Signal Level

A compressor monitors your audio's amplitude in real time. When the signal crosses a set threshold, the compressor reduces the gain by a defined ratio. A 2:1 ratio at a -12 dBFS threshold means every 2 dB above the threshold gets reduced to 1 dB. Attack time controls how fast compression kicks in. Release time controls how quickly it lets go.

Here's what most voice AI tutorials miss. Amplitude normalization scales the waveform to a target peak level. DRC is a nonlinear, time-varying gain function. They solve different problems. When ASR research references "normalization," it typically means feature-space techniques like cepstral mean normalization—not threshold-based compression.

Where DRC Sits in a Production Audio Pipeline

In a typical voice pipeline, audio flows from capture through preprocessing. Then it goes to your speech-to-text API. If you use DRC at all, it sits between echo cancellation and the API call. Order matters, and if you've ever debugged echo artifacts, you know why. Echo cancellation should always come first. Platform-level AEC has direct access to both microphone and speaker output for proper time alignment.

For real-time pipelines, audio preprocessing typically consumes 25–50 ms. That includes echo cancellation plus any additional processing. Adding look-ahead DRC increases that budget by the lookahead window duration. It adds latency millisecond for millisecond.

Client-Side vs. Server-Side Normalization

If your provider already handles gain internally, external DRC is usually unnecessary. It can also create a harmful second processing pass.

You can apply DRC before sending audio to your API or rely on your provider's internal processing. If your provider already applies internal normalization, adding external DRC creates a double-compression scenario. The external compressor reduces amplitude contrast. Then the internal normalizer applies a second pass. This compounds signal degradation.

When DRC Improves ASR Accuracy and When It Hurts

Default to unaltered audio. Reach for DRC only when loudness variation is the specific failure mode you've identified.

High-Noise Environments Where DRC Helps

DRC has conditional value in one narrow scenario: multi-speaker audio with significant loudness variation between speakers. A close-talker and a person farther from the microphone create a significant level gap. Quiet speech can fall below the model's effective recognition floor.

Healthcare ambient capture is one documented example. In these cases, gentle compression can bring the quieter speaker's level into a more usable range. But even here, noise suppression addresses the primary degrader, not level variation.

Scenarios Where DRC Degrades Accuracy

A 2025 study testing four ASR systems across 40 noise configurations found that audio preprocessing degraded transcription accuracy in every single test. DRC raises the level of quiet passages, and background noise rides along with them.

Multiple researchers have argued that preprocessing shouldn't be applied by default—it can degrade transcription in some conditions. Here's why. Modern models extract patterns from the acoustic signal, including components that compression alters or removes.

The commonly cited claim that DRC produces an 18–23% WER reduction is fabricated. Those percentages trace to a LoRA parameter-efficiency paper. It measures model efficiency, not audio-level preprocessing. No Stanford or Google study testing DRC as a preprocessing step for ASR has been identified.

How Modern ASR Models Handle Dynamic Range Internally

Modern ASR models are already trained for noisy, variable audio. That built-in tolerance is one reason external DRC usually adds risk, not value.

Modern neural ASR architectures like Whisper and Deepgram's Nova-3 are trained on diverse, noisy, real-world audio. This training gives them built-in tolerance for acoustic variability—which means your preprocessing might undo work the model already handles.

DRC Settings for Production Voice Applications

If DRC helps your audio at all, keep it gentle. Treat speech-oriented settings as upper bounds, not defaults.

Threshold, Ratio, and Attack/Release for Speech

If your A/B testing confirms DRC improves accuracy on your audio, broadcast standards provide conservative upper bounds. The ITU-R BS.2054-4 profile specifies a 2:1 ratio with onset at −12 dBFS.

Conservative speech settings based on standards-body guidance:

Ratio: 1.5:1–2:1 (never exceed 2:1 for ASR preprocessing)
Threshold: -12 dBFS for peak control
Attack: Fast, under 25 ms
Release: 300–500 ms
Maximum gain reduction: Under 6 dB (beyond this, audible artifacts like pumping and breathing become noticeable in broadcast practice)
True peak ceiling: -1 dBTP

These settings were developed for broadcast delivery, not ASR input. Treat them as upper bounds, not targets.

Multiband Compression for Mixed-Frequency Environments

Single-band compression applies uniform gain reduction, so a low-frequency HVAC hum can trigger gain cuts on speech frequencies. Multiband compression splits the signal into frequency bands and compresses each independently.

The ITU-R BS.2054-2 system requirements recommend several broadcast processing components. These include AGC, a multiband compressor, adjustable attack and release timing, and a limiter. For voice AI, multiband processing adds complexity and latency. Only consider it if single-band compression triggers on non-speech energy sources in your audio profile.

Testing DRC Impact on Your Specific ASR Provider

The only reliable way to know whether DRC helps your pipeline: run controlled A/B tests on your actual production audio.

Send identical audio samples through your ASR provider with and without DRC applied. Compare WER on both sets. If the numbers don't move, you have your answer.

If you're using Deepgram, Spotify's open-source pedalboard library provides a Compressor class with traditional threshold and ratio parameters.

from pedalboard import Pedalboard, Compressor
from pedalboard.io import AudioFile

board = Pedalboard([Compressor(threshold_db=-20, ratio=2.0)])

with AudioFile("input.wav") as f:
    audio = f.read(f.frames)
    sample_rate = f.samplerate

compressed = board(audio, sample_rate)

DRC Across Voice AI Workloads

DRC isn't a one-size-fits-all preprocessing step. Your codec, capture setup, and latency budget determine whether it's redundant, risky, or occasionally useful.

Contact Center Call Audio

Most contact center audio passes through G.711 telephony codecs before reaching your ASR API. G.711 applies logarithmic companding, encoding linear PCM into logarithmic codewords. This companding is, by definition, per-sample dynamic range management. Adding DRC on top creates redundant processing.

G.729's vocoder architecture also makes post-codec DRC far less relevant. The codec models vocal tract parameters, not amplitude waveforms. Contact center platforms like Five9 route audio through telephony infrastructure before the signal reaches your transcription pipeline.

The one valid contact center use case is pre-codec gain normalization to handle agent-to-agent loudness variation at the source. This addresses level differences before the codec chain, not after.

Clinical Voice Documentation

Clinical audio splits into two categories. Close-proximity dictation produces high-SNR audio where DRC adds little value. Ambient room capture introduces the level-variation problem DRC can address, but background noise is the primary accuracy degrader. Noise suppression is the higher-priority fix.

If you do apply DRC to ambient clinical audio, keep compression conservative. Aggressive compression raises background noise amplitude proportionally. That can worsen the SNR that drives transcription errors.

Real-Time Voice Agents

For voice agent pipelines, latency is the binding constraint. Zero-lookahead DRC adds near-zero latency and can normalize inter-speaker level variation. Look-ahead DRC adds mandatory delay equal to the lookahead window. Against a typical preprocessing budget, even a minimal 5 ms lookahead plus 5 ms attack adds about 10 ms of overhead on that stage.

Deepgram's documentation recommends using platform-native echo cancellation as a first step. It also recommends the provider's built-in speech detection rather than external VAD. For pure transcription use cases, the guidance is to send unaltered audio.

Start With Your Audio, Not Your Settings

Start by measuring the problem, not tuning knobs. If you can't show a level-variation issue in your own corpus, skip DRC.

Profile Before You Compress

Measure your audio's actual dynamic range before touching compression settings. Calculate RMS levels, peak-to-average ratios, and noise floor measurements. If your quiet speech is within 20 dB of your loud speech and above the noise floor, you likely don't need DRC at all.

Deepgram's Audio Intelligence features can help identify patterns in your audio corpus. That includes sentiment analysis and topic detection. Low-confidence segments correlated with quiet speakers may indicate a level-variation problem worth addressing. Low-confidence segments correlated with noise indicate a different problem entirely.

Test With Your ASR Provider

Run your profiled audio through your ASR provider with and without DRC. Compare WER, diarization accuracy, and sentiment detection results. If DRC doesn't measurably improve your target metric, skip it. Your ASR model was trained to handle the variation you're trying to fix.

Try it yourself—grab $200 in free credits and get started with Nova-3 on your actual production audio.

FAQ

What Is DRC in the Context of Voice AI?

In voice AI, DRC reduces the amplitude gap between loud and quiet parts of an audio signal before it reaches your speech-to-text model. Use it only when level variation is the actual problem.

Does Deepgram's Speech-to-Text API Require DRC Preprocessing?

No. Deepgram recommends sending unaltered audio for transcription. If your input already has stable levels and acceptable noise, extra compression usually adds risk.

Can DRC Improve Transcription Accuracy in Noisy Call Centers?

Usually no. Pre-codec gain normalization may help with loudness variation, but extra compression after the codec stage is usually redundant.

What Settings Work Best for Real-Time Voice Agents?

Use zero-lookahead compression if you need it at all. Keep ratio and gain reduction conservative, then measure transcription quality and preprocessing latency.

How Does DRC Interact With Speaker Diarization?

DRC flattens amplitude patterns that carry prosodic information like intensity variation and stress. That can make speaker separation and downstream sentiment analysis less reliable, especially when compression is aggressive.

Unlock voice AI at scale with an API Call

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.

Listen to article10:13

Key Takeaways
What DRC Does in a Voice AI Pipeline
How DRC Works at the Signal Level
Where DRC Sits in a Production Audio Pipeline
Client-Side vs. Server-Side Normalization
When DRC Improves ASR Accuracy and When It Hurts
High-Noise Environments Where DRC Helps
Scenarios Where DRC Degrades Accuracy
How Modern ASR Models Handle Dynamic Range Internally
DRC Settings for Production Voice Applications
Threshold, Ratio, and Attack/Release for Speech
Multiband Compression for Mixed-Frequency Environments
Testing DRC Impact on Your Specific ASR Provider
DRC Across Voice AI Workloads
Contact Center Call Audio
Clinical Voice Documentation
Real-Time Voice Agents
Start With Your Audio, Not Your Settings
Profile Before You Compress
Test With Your ASR Provider
FAQ
What Is DRC in the Context of Voice AI?
Does Deepgram's Speech-to-Text API Require DRC Preprocessing?
Can DRC Improve Transcription Accuracy in Noisy Call Centers?
What Settings Work Best for Real-Time Voice Agents?
How Does DRC Interact With Speaker Diarization?

Listen to article10:13

Key Takeaways

Key research findings on DRC for voice AI:

No major ASR provider recommends applying external DRC.
The commonly cited "18–23% WER reduction" figures are fabricated. They trace to a LoRA paper unrelated to audio preprocessing.
Aggressive compression strips prosodic cues that diarization and sentiment models rely on.
If you have a documented level-variation problem, use conservative settings: 1.5:1–2:1 ratio, under 6 dB gain reduction.
Always A/B test DRC against unprocessed audio on your specific ASR provider before deploying.