Environment-Aware TTS Just Sounds Better: The Key to Natural-Sounding AI
Synthesizing perfect sounding speech is great, but it's only one part of producing natural-sounding Text-to-Speech (TTS). If we want that, non-speech, "environment" noise matters at least as much as getting the voice right. Why is that?
Imagine you're crafting an indie game. Strapped for cash, you scrounged some open-source background audio and hired machines as voice actors (meaning you bought some TTS API credits and synthesized each character's dialogue).
You hope to produce such engrossing dialogue that players forget they're playing a game, meaning your characters' voices need to actually sound like they are in different places as the settings change.
You're nearly set. In isolation, the background audio clips sound like an alley, hot air balloon, lecture hall, and cave, as they should. And the voices also sound natural—in isolation. But you’re facing some serious problems.
When Synthesized Speech Sounds too Clean
When you try to splice your machine-generated voices into these scenes' corresponding background audio, the voices sound too sterile, like they were recorded in a studio and then jammed into the background audio; the voice and environment sounds just don’t gel. Additionally, you’re ever tweaking the scripts, and, to avoid having to regenerate entire speech segments, you tried to splice the short edited voice segments into their respective larger speech segments, but it sounds awkwardly obvious that you did so; the edited voice segment doesn’t gel with the broader segment, despite being the same machine “voice actors.”
For example, one of your pre-generated speech segments was a lone explorer, pepping herself up before descending into a deep cave. When you spliced her speech in, you heard neither dripping nor any echoes. Another speech clip was of three interlocutors in a hot air balloon high above the earth. When you spliced this speech clip in, you didn’t hear shouting over the gas burner's muffled roars and ambient breeze.
What you need is a generated speech segment with similar background sounds as the original "cave" audio and another that blends in well with the original hot air balloon audio—otherwise, the spliced speech clips sound oddly clean, stranded in the middle of lively environment noise. You thought you were being clever and frugal by harnessing TTS—now you're pulling your hair out.
Environment-aware TTS Can Help
Thankfully, researchers are making headway in "environment-aware TTS," in other words, synthesizing voices in a way that incorporates background noise.
This capability would be useful when using TTS to create films or video games; environment-oblivious TTS doesn't hack it for either of these applications. But the utility of environment-aware TTS isn't limited to creative endeavors. Nearly every TTS application could benefit from sounding more natural.
If you phone your bank, for example, and get stuck talking to a machine, it's nice if that machine's voice sounds like a human that's on a phone (rather than in a cave or hot air balloon). Same deal for machines taking your drive-through order, ebook readers navigating different characters and settings, and many more applications—even though we know when we're listening to a machine voice, we still want it to sound like that voice originates from the environment we expect it to be in. Before we see how TTS models can somewhat pull this off, we should loosely clarify "environment noise.".
What Exactly is Environment Noise?
Many different sounds might be considered environment noise, but the easiest way to carve this out is to say there are speech sounds and there are all these other non-speech noises accompanying speech sounds. This non-speech, "other sounds" category is environment noise.
Environment noise might originate from organisms (e.g., people, birds, crickets, or frogs) or things (e.g., traffic on nearby roads or an airplane passing overhead). But environmental noise is more complex than just external sources.
Reverberations and echos play an important role too. So too do surfaces, which absorb and reflect sound to varying degrees. This can all get quite granular; holding everything else constant, a conversation in the same carpeted room would sound different if the room had a wood floor.
Room dimensions matter. Speaker positions matter. Furniture (and other objects' positions) matter. Many things can affect how sound waves propagate indoors and outdoors, and environment-aware TTS needs to somehow capture this.
How do Environment-Aware TTS Models Work?
We might guess that environment-aware TTS models need to disentangle speech from background noise (similar to the way we typically hear speech as distinct from background noises). That’s the right intuition.
To separate environment noise from speech, current approaches largely separate all noise into speech and non-speech buckets (like we did above), training models on these two categories so the models learn the differences between environment noise and speech noise. These models would also need to combine speech and environment sound so that the final output is a voice that sounds like it’s in a specific environment.
Assuming machines can separate speech and environment noise, they might then "acoustic match" between a speech clip and its coinciding environment sounds (e.g., we could generate a small clip of speech to splice it into a longer segment of someone whispering inside a cave, and it would sound like that same person was whispering inside the same cave).
Now that we understand the broad approach (somehow decomposing between speech and non-speech sounds and combining the two sounds during synthesis), let’s look closely at two recent approaches.
Decomposing Speech and Environment Noise
To create an environment-aware TTS model that could captured the “sound” of a room, Chinese University of Hong Kong researchers Tan et al. created an architecture with the following three main components:
speaker embedding extractor
environment noise embedding extractor
TTS module
You can already see they decomposed this problem into speaker and non-speaker noises, but let's look closer at how each of these works.
Speaker Embedding Extractor
Based on Wan et al.'s Speaker Verification system, Tan et al.'s "speaker embedding extractor" transformed speech utterances into numerical representations in a way where embeddings from the same speaker were close to each other in the vector space and, conversely, embedded utterances from different speakers were far apart in the vector space.
To train their speaker embedding extractor, Tan et al. used VoxCeleb1, a voice dataset containing 1,251 different celebrity speakers, including utterances from the same speaker in different environmental conditions—an important data feature for their model to learn differences between speaker and environmental noise.
They then cut 80-frame-length chunks from random places within each speaker's utterances and fed these to a model consisting of three 256-dimension Long Short-Term Memory layers, designed for sequential data (like language), followed by one linear layer. The resulting outputs formed a "speaker similarity matrix," which contrasts each individual speaker's embedding with a centroid representing the collective average of all speaker embeddings within the vector space.
Environment Embedding Extractor
Next, Tan et al. needed a way to extract and embed environment noise. To build this, Tan et al. used Room Impulse Responses (RIRs), measurements of how sound waves reflect, echo, and get absorbed when moving through different spaces. For a nice demonstration of how RIRs affect sound, watch the first few minutes of this video:
Video credit: Free To Use Sound (Marcel Gnauk records RIRs from a few different environments and demonstrates how they can be used to produce more natural sounding voice recordings).
Using 2,325 different measured RIRs from the ReverbDB dataset to represent different sound environments, they convolved (combined) RIRs with clean speech utterances from the LibriTTS dataset, making the speech sound more reverberant.
Tan et al. ensured that each set of the same environment's audio contained numerous speakers (so their model wouldn't simply memorize a speaker by associating that speaker with a single environment or vice versa). Like their speaker embedding extractor, Tan et al. designed their environment embedding extractor to make similar environments' embeddings close to each other in the vector space and disparate environments' embeddings far from one another. Also like the speaker embeddings, each environment embedding was compared with the centroid of every other environment embedding to create an environment embedding similarity matrix.
Text-to-Speech Module
The last piece of Tan et al.'s architecture, a TTS module, involves the following steps:
Given a text, a "duration predictor" estimates that text's corresponding phonemes' durations (i.e., how long each constituent sound ought to be)
a length regulator and encoder are both fed a text and it's duration prediction
the speaker and environment embeddings are concatenated to the encoder's output
a preliminary neural network (Prenet) does, as it sounds, some preliminary processing before sending that data to a decoder to generate a mel-spectrogram (a visual representation of sound waves across time), which is then transformed to speech wavelengths via a HiFi-GAN vocoder.
Here's what the whole system looks like:
The Experiment
Using different speaker and environment sound data than they trained their embedding extractors on, Tan et al. used sound where the speaker and environment sounds were completely entangled; each of the 108 randomly selected speakers was paired with randomly selected, unique environment (RIR) noise. Tan et al. selected 100 utterances from each speaker-environment pair: one speaker-environment pair included "clean" environment sound, the remaining were the speaker and RIR environment sounds convolved together. Ensuring their TTS didn't train on the same data as their environment and speaker embedding extractors trained on, Tan et al. set aside 5% of these speaker-environment pairs for validation and trained their TTS module on the rest.
As a baseline system to test their proposed system against, Tan et al. used the same components (speaker and environment embeddings and a TTS module), but trained their baseline system to classify speakers and environments without disentangling them; they trained their proposed system to untangle speaker and environment noise.
The Results
After testing their proposed system, Tan et al. found that their speaker embedding extractor parsed out who was speaking, regardless of the accompanying environmental noise. Likewise, their environment embedding extractor identified environment sound, regardless of the speakers. How'd they verify this?
To test their system, Tan et al. embedded two original speech utterances into their respective speech and environment embeddings. From these embeddings, Tan et al. synthesized new speech utterances.
They then inputted these synthesized utterances into their proposed system (designed to extract a speaker embedding and an environment sound embedding from the synthesized utterance).
t-SNE Tests
Next, from these derived speaker and environment embeddings and their labels, Tan et al. made t-SNE visualizations (t-distributed stochastic neighbor embedding), an algorithm that maps high-dimensional data (like sound or image data) to a 2-dimensional representation. In t-SNE visualizations, similar dimensioned entities are grouped close together and dissimilar entities far apart.
These t-SNE visualizations suggest that the speaker extractor extracts speaker sounds without learning the environment sounds, and, conversely, the environment extractor extracts environment sounds without learning speaker sounds. Take a look at them below:
Measuring Generated Speech Quality
As a system-level objective test, Tan et al. used Mel-Cepstrum Distortion to measure how close their synthesized speech and environment noise sounded to actual human speech and environment noise. They used scenarios for this:
one where the system had previously seen a voice and environment noise paired together
one where it saw the voice and environment noise but never together
one where the voice and environment noise were novel to their system
The results showed that Tan et al.'s proposed system wasn't as good as their baseline system at synthesizing voices it had already "heard" during training. Where Tan et al.'s proposed system excelled was in mixing voices and environments it had seen before in new ways, even creating speech for completely new voices and environments it hadn't trained on.
Judging Generated Speech Quality
For a subjective test, Tan et al. asked people to listen to the machine-generated speech and rate how similar it sounded to real speech. People thought the proposed system produced more realistic speech sounds than the baseline system, particularly when mixing known voices and environments in new ways or when creating speech for new voices and places. Together, Tan et al.'s tests suggest their system can make new, environment-aware, realistic-sounding speech even in situations it hasn't directly learned from before. You can judge for yourself by listening to samples of their synthesized speech for each of the above three scenarios. All around, they mimic the sound of the reference voices, and the reference environment sounds quite well.
VoiceLDM
Tan et al.'s approach to making synthetic voices that sound like they fit within specific rooms, but we might want to generate speech within more dynamic real-world environments (like caves and hot air balloons), and we might want more control over the generation process. VoiceLDM offers both of these things: more control, and more dynamic background noise.
Korea Advanced Institute of Science and Technology researchers Lee et al. created the more steerable VoiceLDM to take in two prompts:
a prompt for the content (whatever you want the TTS model to "say")
a "description" prompt informing the model what kind of environment or background noise to place the speech within
What VoiceLDM Can Do
Inspired by AudioLDM, Lee et al. structured VoiceLDM as a Text-to-Audio (TTA) model that incorporates latent diffusion conditioned on a "content" prompt (TTS is a subset of TTA, which includes applications beyond TTS like text-to-music).
The results? VoiceLDM synthesizes audio that incorporates environment noise and speech. You can listen to some VoiceLDM-synthesized TTS samples here. When VoiceLDM is given a content and description prompt, the environment noise sometimes sounds scratchy. The speech also sometimes sounds distorted, but, overall, it's not bad. If you only give VoiceLDM a content prompt, it just synthesizes a "clean" voice without background audio (i.e., TTS). Conversely, if you only give VoiceLDM a description prompt, it only generates environment sound (i.e., TTA).
This means you can ask VoiceLDM to generate speech with specific speech sounds (e.g., whispering, singing, yelling, etc.) in different environmental settings (e.g., in the rain, at the beach, in a cathedral, in a cave, in a hot air balloon, etc.) or just the speech or just the environment sounds (e.g., birds chirping, a violin playing, a crowd of people, etc.), making for a very diverse tool.
Many recent TTA models can generate background noise well, and many TTS models produce clean-sounding speech well, but when you try synthesizing both of these together, either the speech or the background noises often sound garbled. Lee et al. see VoiceLDM as an innovative intersection between TTS and TTA. VoiceLDM or a similar model’s capabilities could prove clutch for an indie game maker, film maker, or anyone utilizing TTS in scenarios where background noise is important.
VoiceLDM's Components
Let's look at VoiceLDM's high-level inputs and outputs to understand how it works. First, the inputs. The following two natural language prompts are fed into VoiceLDM:
textcont : the content of some speech (what words the machine should "speak")
textdesc : a description of the environment noise (e.g., shouting over a nearby freight train passing, whispering in a library, a cafe, etc.).
These two inputs go through several modules before VoiceLDM outputs an audio clip conditioned on both inputs (or just one, if the other is blank). What’s going on "under the hood"? We'll go through each step (and when we do, you'll notice some components similar to Tan et al.'s) but, first, you may want to glance at the below diagram for an overview.
Mapping Text to Audio
In the upper left portion of the diagram, you'll see the first module is CLAP (Contrastive Language-Audio Pre-training, a model trained to detect similar audio-text pairs by bringing similar audio-text pairs close to together and pushing dissimilar pairs far apart in the embedding space (the same concept Tan et al.’s speech and environment embedding extractors employed).
CLAP can match text-to-audio or audio-to-text. In VoiceLDM's case, CLAP does the former; a description of environment sounds, textdesc, gets passed to CLAP, which then outputs cdesc, a 512-dimension vector representing the description (of the environment sound) condition.
Then, the other prompt, textcont, is passed through a pretrained SpeechT5 content encoder to embed the speech's content text into audio features, and then sends those to a "differentiable durator" (another component we saw in Tan et al.'s approach) to model how long synthesized phonemes (i.e., distinct, meaningful, often subword sound units) should be given an embedded text (modeling how long phonemes ought to be is a crucial step in generating natural-sounding text-to-speech—if you get phoneme durations wrong, synthesized voices sound robotic). The output of the durator is ccont, the content condition.
Diffusion-Based Audio Generation
Next, both the conditions ccont and cdesc, along with a timestep embedding, are inputted to a U-Net, a neural network initially designed for medical image segmentation tasks that uses a large-to-small encoder and symmetric small-to-large decoder (hence the 'U'). U-Net then outputs a predicted diffusion score to use when denoising signals. The bulk of current diffusion models are applied to image generation, but they are increasingly being applied to other modalities, including audio (often voice and music).
Broadly, audio diffusion models learn from adding increasingly more random noise to audio data across time steps until the audio is entirely noise (akin to a drop of dye dispersing throughout a glass of water). Below is an illustration of this process applied to an image:
Then, during "denoising," this diffusion process is reversed, meaning that, starting from complete noise, the model iteratively removes noise until only clean audio remains.
From this "noising" and "denoising," diffusion models learn to predict audio by calibrating the difference between their noise prediction and the actual noise at each time step.
Producing the Final Audio
VoiceLDM's reverse-diffusion model outputs its predicted audio and feeds it to a pretrained Variational Autoencoder (VAE)—a model that learns and encodes features in continuous data (audio in this case) and then, with a decoder, generates data that's similar but unique to the data it learned from. This VAE translates the reverse-diffusion model's predicted audio to a mel-spectrogram (a visual representation of frequencies across time), which is then converted to high-quality audio by a HiFi-GAN vocoder.
So that they learn how to work together, the U-Net backbone, content encoder, and differentiable durator are all trained together. Throughout training, the pre-trained CLAP, VAE, and HiFi-GAN vocoder models remain frozen (i.e., their weights don't update).
Emphasize Speech or Environment
Because VoiceLDM has two conditions, you can weigh them independently. If, for example, speech quality is more important for a specific audio generation, you can increase the content's weight and lower the sound description's weight. If you prefer better environmental noise, do the opposite. If you want balance, set the weights near 50/50. In a bit, you’ll see from Lee et al.’s tests that how you set these weights significantly impacts the generated audio’s quality.
Training VoiceLDM
For training data, Lee et al. split the English portions of the AudioSet, CommonVoice 13.0, VoxCeleb1, and DEMAND—all public audio datasets—into speech or non-speech segments. Since some of this speech data wasn't transcribed, Lee et al. transcribed it with a medium Whisper model, using these transcriptions to classify the speech clips' language.
They then sent speech classified as "English" to a large Whisper model to get more accurate transcriptions. For some shorter audio clips (less than 10 seconds) with existing transcriptions, Lee et al. used the existing transcriptions.
This gave them 2.43 million speech and 824,000 non-speech segments, all standardized to 10 seconds long (by taking the first 10 seconds of long clips or padding shorter clips).
To make VoiceLDM more generalizable, Lee et al. randomly added non-speech audio to half of the CommonVoice speech segments and randomly altered the non-speech segment's signal-to-noise ratios (all other audio data already contained noise).
With this data, Lee et al. trained two VoiceLDM models: a 128 and 192-channel dimensioned model (amounting to 280-million and 508-million parameters, respectively)
Testing VoiceLDM
Lee et al tested VoiceLDM on three scenarios:
only speech generation (i.e., clean speech)
only environment noise generation (i.e., pure audio)
speech generation and environment noise together
Testing Simultaneous Speech and Environment Audio Generation
Lee et al. created an experiment setting VoiceLDM-Maudio by feeding it ground truth audio (instead of the textdesc they'd normally input). To quantitatively measure VoiceLDM's ability to generate speech and environment sound together, they weighted the content and environment sound descriptions equally (both at 7) and used the following metrics on the AudioCaps test set (a dataset with audio and human-written caption pairs):
Frechet Audio Distance (FAD): measures how similar generated and real audio distributions are to each other, considering audio features and time (lower = better)
Kullback-Leiber (KL) divergence: also scores similarity between generated and real audio but does so by measuring the information lost when one distribution is used to approximate another (lower = better)
CLAP score: measures how well the audio matches the text descriptions (higher = better)
Word error rate (WER): measures how close synthesized speech is to a reference text by tallying up the minimum number of additions, deletions, or substitutions required to make the speech match the reference text (lower = better)
For this test, Lee et al. compared VoiceLDM's performance with AudioLDM2, another model that also allows user-specified speech or audio descriptions (but not at the same time like VoiceLDM). Below are the results:
Across these tests, VoiceLDM-M and VoiceLDM-Maudio performed significantly better than AudioLDM2.
For a qualitative assessment, Lee et al. asked at least 10 listeners per audio generation to rate the following elements (on a scale of 1 to 5):
Overall Impression (OVL): how good a generated audio clip sounds in general
Relevance Between Audio and Condition (REL): how well the generated audio matches the environment noise and content descriptions
Mean Opinion Score (MOS): how natural and clear the audio generation sounds
For these qualitative tests, VoiceLDM-Maudio performed close to or better than ground truth.
Overall, these tests show that VoiceLDM generates audio that considers both inputs (speech and environment audio descriptions) simultaneously, whereas AudioLDM does one or the other well but fails to produce clear speech and environment sounds together.
Testing Speech Alone
To test VoiceLDM's speech generation in isolation, Lee et al. inputted "clean speech" for textdesc, lowered the environment audio description weight to 1, and upped the content weight to 9. With this setup, they compared VoiceLDM's performance on the CommonVoice test set against the SpeechT5 and FastSpeech2 (both TTS models) using WER and MOS.
Though they lagged in MOS scores, you can see all VoiceLDM models outperformed ground truth on WER scores, and VoiceLDM-M earned significantly better WER scores than FastSpeech 2 and SpeechT5.
Testing Environment Audio Alone
VoiceLDM's training sample consisted solely of audio clips containing human voices. Yet it can perform zero-shot TTA if you make the content description an empty string. To test how well VoiceLDM produces audio without voice, Lee et al. upped the environment noise description weight to 9, lowered the speech weight to 1, and compared VoiceLDM's performance on the AudioCap test set against AudioLDM.
You can see that, in general, VoiceLDM doesn't perform near AudioLDM in pure TTA nor does it perform near ground truth in CLAP.
Bringing Conversations to Life
Environment-aware TTS is an important advancement, enabling synthesized speech to sound situated within realistic-sounding acoustic environments. By accounting for ambient noise, reverberation, and other auditory factors, TTS can sound much more natural. Technologies like Tan et al.’s dual embedding approach and VoiceLDM showcase this potential for synthesized speech that fits its surroundings. But the benefits don’t stop here.
Many current TTS models train primarily on clean speech recordings, which is problematic because:
Producing sterile speech recordings is costly; it typically requires a sound-proof, studio-like recording room and professional equipment.
We have oceans of unclean audio data (YouTube, TikTok, radio, podcasts, etc.). If we could utilize this sound data, whatever its imperfections and environment noise, we would have much more speech data (and more robust data) to train models on, leading to more generalizable TTS.
Thanks to research like Tan et al.'s and Lee et al.’s work, future TTS models will likely shed their dependency on clean speech data. This should translate to more robust models, which will help many TTS use cases, whether it’s phoning your bank or generating your own soundscapes and speech for an independent video game. It’ll be interesting to watch just how realistic TTS will end up sounding and how granular our control over the generation process can get as people experiment with different ‘environment-aware’ approaches.
Note: If you like this content and would like to learn more, click here! If you want to see a completely comprehensive AI Glossary, click here.
Unlock language AI at scale with an API call.
Get conversational intelligence with transcription and understanding on the world's best speech AI platform.