In grad school at Stanford, there were quite a few opportunities for independent projects that we could pursue. I’d always been interested in voice AI, especially voice cloning. As a result, one project I pursued was a “Child-to-Adult Voice Style Transfer” AI pipeline. 

Author’s Upfront Note: Companies like ElevenLabs and Deepgram have very much outperformed anything presented in this article. That is, incredible companies have mastered voice-AI much better than the collegiate version of myself. Nevertheless, this article provides incredible insight into the complexities of audio, machine learning, and machine learning research.

Voice style transfer is the process by which we turn one person’s voice into another. Here’s how that process works:

  1. I record myself saying a sentence like: “I want chocolate cake right now.”

  2. I input that recording into the Voice-style transfer AI.

  3. I choose someone else’s voice. Let’s say I choose Elvis Presley.

  4. Given my original audio and the target voice of Elvis, the AI should be able to transform my voice/words into Elvis saying the exact same thing.

Numerous state-of-the-art models boast the ability to produce .wav files that mimic a certain target speaker’s voice, even after hearing only one recording of that speaker. The ability to disguise one person’s voice as another person’s voice has applications in numerous domains, ranging from the entertainment industry to privacy and security. 

However, the current models share one common imperfection: None of them have ever been trained or tested on children’s voices.

In my project, we attempt to perform child-to-adult voice style transfer using current state-of-the-art models. We find that even the best voice style transfer pipelines have a difficult time handling child inputs. This result occurs in spite of the fact that their adult-to-adult voice conversion presents impressive results.

More specifically, the voice cloning method retains the content of the original speech, but displays lackluster style transfer. Meanwhile, zero-shot and traditional many-to-many voice style transfer models provide incredible style conversion, but the content of the speech becomes lost.

Here are the details of this project:

The Ello Speech Dataset: Data that contains parallel child and adult speech

Our data was provided by Catalin Voss of Ello, a startup technology firm specializing in child literacy and educational products. The dataset consists of over 10 hours of page-by-page recordings of an adult voice actor and young children reading from a set of over 150 children’s books. The recordings are of high quality and contain minimal background noise.

The image above summarizes the organization of the data and provides high-level metrics.

The first noticeable difference between the adult speech samples and child samples is that the adult samples are all from the same individual, whereas the child samples are taken from a group of 13 separate kindergarten through 2nd grade students. Although we have multiple child speakers, we rarely have a book that was read by more than one child.

Again, see the image below for more details. It sorts the books by number of utterances and stratifies them by age (adult/child).

Given that the average number of words per utterance is higher for the adult speaker than the child speakers, it is likely that the omitted books are of a more advanced level with more words per page than would be appropriate for 5-7 year olds.

Our Approach: The Voice Cloning Architecture

Approach 1: Voice Cloning Architecture

Our approach consists of three different models. The first model is a classic voice-cloning architecture. Namely, we input a transcript and a child speaker’s .wav file, and the model produces a transformed version of the audio. The image below illustrates the pipeline of this particular method. The decoder and vocoder of this model (Modules 2 and 3 in the image) have been pre-trained on adult speakers, allowing the model to conduct child-to-adult style transfer.

Approach 2: Few-shot AutoVC Architecture

In our second approach, we use the AutoVC model in two different ways. First, we conduct a zero-shot experiment. That is, we use AutoVC out of the box with no further modifications. We simply feed it two wav files: the child audio we want to modify and the adult target audio. 

The second experiment we conduct makes use of a traditional many-tomany voice conversion model. Again, we use AutoVC. However, instead of simply using the model as is, we add an additional finetuning layer during training.  Specifically, we finetune the content  encoder and the decoder on the adult-and-child parallel dataset we have on-hand. 

We expected this model to perform better than the zero-shot model because, intuitively, we believe a voice style transfer architecture will naturally perform better on voices it has seen during training.

The AutoVC pipeline functions as follows: Two input audios are fed into the model. The first audio is the original speaker whose voice we want to modify. The second is our target speaker. The original speaker audio is fed into a “Content Encoder,” wherein the model extracts the features of what is said. 

At a high-level, the content encoder listens for the original speaker’s words—not their voice—and embeds those words into a high-dimensional space. Meanwhile, the target speaker’s voice is fed into a Style Encoder. This style encoder listens for the cadence, accent, and fluency of the target speaker among other features. Once again, these features are vectorized and placed into a high-dimensional embedding space. 

Now that we’ve embedded the words of our original speaker (the child) alongside the speaking style of our target speaker (the adult), we are ready to decode the two embeddings. A decoder takes the two input embeddings and essentially fuses them together. 

The result is an embedding of an audio that contains the target speaker’s voice on the original speaker’s words. This embedding is fed into a vocoder that transforms this embedding into a spectrogram and corresponding audio. The image below illustrates this pipeline.

The results: Which AI performed best? It turns out… none of them 😅

Before delving into the results of child-to-adult voice style transfer, it is crucial to establish a baseline for the performance of these models. An example of the voice cloning results for adult-to-adult speech style transfer using AutoVC is presented in the image below, with associated mel-spectrograms. 

The audio file for the traditional approach can be found at this link, while the audio file for the zero-shot approach can be found at this link.

Upon listening to these examples, it becomes clear that AutoVC has no problem converting one adult voice to another, even across sexes and accents.

With these baselines in mind, let’s examine the results of these models on child-to-adult voice conversion. Note that the spectrograms of the audios of the target speaker and original speaker are presented in the image below alongside all generated speech style transfer spectrograms.

Voice cloning results

Upon examining the .wav file produced by voice cloning—linked here—we immediately see that much of the content of the original speech was preserved. Unlike the models we will examine later, voice cloning produces speech that is easy to parse and understand.

Stylistically, however, voice cloning falls a bit flat, especially in comparison with the two AutoVC models discussed below. While the audio generated does not sound like a child, the fact remains that it could sound more like an adult. Such high-quality, adult-style speech is beautifully portrayed by the outputs of the other two models.

We suspect that the preservation of speech content arises because this model takes a transcript as input. Thus, it has an easier time extracting phonemes and recognizing English words than its AutoVC counterparts.

Zero-Shot AutoVC Results

Zero-shot AutoVC produced a .wav file whose style sounds much better than the voice cloning result. The audio produced—linked here—sounds very much like an adult woman. However, this improvement in style comes at the cost of content preservation. 

More specifically, the audio produced here sounds like an adult woman speaking gibberish. Admittedly, the people we presented this audio to stated that the gibberish sounds more like French or Mandarin than English. Nevertheless, the gibberish produced does indeed sound like an adult.

Examining the corresponding spectrogram in the image above, however, it is not immediately obvious that the content of the original speaker was lost in the generated audio. Thus, we can infer that the model was able to create and decode an embedding that looks right, but was ultimately superficial in its feature extraction.

Traditional, Many-to-Many AutoVC

This model, unlike its zero-shot counterpart, has had its Content Encoder and its Decoder trained on our parallel dataset. As a result, we expected this model to outperform the previous one. Our results were surprising. 

Again, the model produces an audio that sounds like an adult woman speaking gibberish. However, when listening to the .wav file—linked here—it’s clear that this generated voice is distinct from the one produced by the zero-shot model. That is, the two differently trained AutoVC models produced outputs that sound like two different women speaking different gibberish. However, much like the previous model, the general consensus of listeners we’ve lightly surveyed is that the gibberish sounds like Mandarin or French.

Moreover, regarding the spectrogram, we run into the same observation as the zero-shot model: it is not immediately obvious that the content of the original speaker was lost in the generated audio. The spectrogram looks like that of a normal nongibberish utterance. From this, we reach the same conclusion—the model, even when trained on the children’s voices it’s intended to convert, remains ultimately superficial in its feature extraction and decoding

Conclusion

In this paper, we reveal that even state-of-the-art voice-style transfer architectures have a tough time processing child speech. At a high level, we can intuit that one reason for this difficulty is the fact that children not only have high-pitched voices than adults, but they also have a different “cadence” and “rhythm” than adults. They need to soundout words, and they stumble in different ways than adults do.

But again, companies like Deepgram and ElevenLabs are not only mastering but also perfecting the art of voice-AI and auditory machine learning. My pet project simply offered me a chance to see just how deep the vocal forest really goes.

Special thanks to Catalin Voss of HelloEllo, Nabil Ahmed of Stanford University, and Lily Zhou of Stanford University for all their contributions to this project.

And if you’d like to see the original paper, check out this link!

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.

Sign Up FreeBook a Demo
Deepgram
Essential Building Blocks for Voice AI