Developing an Artificially Intelligent Voice: A Brief History of Text-to-Speech
Across apps like Tiktok or Instagram reels, users can find videos of former U.S. president Barack Obama singing the lyrics of Shape of You by Ed Sheeran or Formula 1 driver Charles Leclerc belting a song by Taylor Swift. Neither cover has happened, however with the increased advancement of AI technology, videos with voice cloning can be easily created by the general public. While some videos are hilarious, others can be more nefarious like a fake voice recording of a Progressive Party leader in Slovakia encouraging a plan to rig a vote or alleged “leaked” recordings of Omar al-Bashir, a former leader in Sudan.
On the other hand, the public encounters interactions with text-to-speech systems (TTS) everyday, whether it be asking Siri to research a topic or Alexa to play some music. However, what is the difference between a text-to-speech system and voice cloning?
Traditional TTS technology converts written text into spoken words. Earlier systems relied on three different synthesis techniques: articulatory synthesis, formant synthesis and concatenative synthesis. Articulatory synthesis mimics how humans produce sounds, by copying movements of the lips, glottis, and more, but it is difficult in implementation because it requires modeling articulatory behavior and it is also difficult to obtain this data to model this behavior. Formant synthesis is modeled on a set of linguistic rules and the source-filter model, a simplified theory that describes speech production as a two stage process, to produce speech. This particular method does not require extensive linguistic data or recordings but as a result it can sound artificial. Finally, concatenative synthesis is dependent on a large database of speech fragments by voice actors because it breaks these fragments into smaller parts and stitches them together to produce natural speech. However, because of the necessity for a large corpus, there is less variety in speech styles that can be mimicked and speech can sound more emotional.A more evolved version of these techniques came about called SPSS using statistical models to produce speech.
However, modern TTS systems leverage deep learning techniques, specifically neural networks, to generate more natural and human-like voices. These systems, often referred to as deep neural networks (DNNs) or end-to-end neural models, can understand context, intonation, and emotional cues to produce speech that closely mimics human intonation and rhythm. For example, the development of WaveNet by DeepMind in 2016 marked a significant leap forward, producing speech that can be almost indistinguishable from human speech. This technology uses a neural network that mimics the human voice by analyzing waveforms, resulting in a more natural and lifelike output. TTS systems are now integral to assistive devices, helping visually impaired individuals access written content, and in educational tools, where they facilitate language learning and literacy.
Voice Cloning: An even more robust version of TTS
Voice cloning is a more specialized application of speech generation technology that aims to replicate a specific individual's voice. It involves capturing the unique characteristics of a person's voice, such as pitch, tone, and accent, from a few speech samples and then using this model to generate new speech in that person's voice. This model is then capable of producing speech that sounds like the original speaker, even saying words or sentences the person never actually recorded. Thus, voice cloning can be more extensive and complex in its implementation compared to many TTS systems.
Like modern TTS, voice cloning often uses advanced machine learning techniques to achieve high levels of realism. Two approaches that can be used for voice cloning are speaker adaptation and speaker encoding: the first of which intends to fine tune an existing multi-speaker model based on audio-text pairs and the second directly estimates the speaker embedding, or a representation unique to the speaker, from the audio itself. These techniques show greater promise with advanced multi speaker models and a larger set of audios.
The video below provides a humorous example of voice cloning in action. It depicts former presidents Barack Obama and Donald Trump playing a video game together.
Voice Transfer: The oft-overlooked branch of TTS
The act of voice transfer is a bit more specific. The idea is this: given a pre-recorded audio of a particular person’s speech, the AI should be able to turn that person’s voice into someone else’s while maintaining the original pace, rhythm, and intonation of the initial recording.
One simple example of Voice Transfer can be found in this article, in which the author walks through a research project he conducted. The project entails building an AI that can transform children’s voices into the adults they will grow up to become.
Note that the code and overall technique in this article is a mere prototype. Other existing voice-transfer machines exist that have produce much, much superior results to the one found in the article above.
Of course, the inverse model exists—one which transforms an adult’s recording into the voice he or she had as a child. However, this is merely the tip of the voice-transfer iceberg. Applications currently include a gender-swapping model—both MtF and FtM—as well as a general voice transfer model that can take as input any starting voice and target voice. AutoVC is one such AI.
Applications of Text-to-Speech
Both TTS systems and voice cloning share common steps in processing input (text or speech for cloning), analyzing it, and then synthesizing speech output. The advancements in neural network architectures, like convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have been pivotal in improving the quality, naturalness, and efficiency of both technologies.
Voice cloning, given its capacity to replicate specific voices, is particularly useful in personalized advertisement, dubbing for movies or video games, and as a tool for individuals who have lost their ability to speak to communicate in a voice that resembles their own. While text-to-speech systems and voice cloning share foundational technology in speech synthesis, they diverge significantly in their objectives, implementation complexities, and ethical implications. TTS offers broad applications in making digital content more accessible and interactive, whereas voice cloning offers highly personalized speech synthesis with unique applications and significant ethical considerations.