From Hawking to Siri: The Evolution of Speech Synthesis
In 1985, physicist Stephen Hawking lost his ability to speak after an emergency tracheotomy, one of many procedures to manage his Motor Neuron Disease, a disease that leads to the loss of neurons that control voluntary muscles. To help Hawking communicate, Denis Klatt, a speech synthesis scientist who had previously contributed to the development of the text-to-speech system, built a machine, Speech Plus CallText 5010, that allowed him to communicate with other people.
“This system allowed me to communicate much better than I could before …Using this system, I have written a book, and dozens of scientific papers. I have also given many scientific and popular talks […] I think that this is in large part due to the quality of the speech synthesizer,” Hawking said.
Today, speech synthesizers have helped millions of people with disabilities to listen to written words and communicate with their communities. They are also used in voice-activated assistants, navigational systems, and for generating text from speech in real time. On top of that, they have had a major influence on pop culture from influencing multiple genres of music to inspiring movies. With all its applications today, it can be difficult to remember that speech synthesizers as we know them started as a keyboard connected to a loudspeaker through a bunch of circuitry.
Here’s a look at how speech synthesis and its technology has evolved over the years.
For centuries, people have been fascinated by objects that sound like humans. In the Middle Ages, there were legends about Brazen Heads, a mystical mechanism that was able to answer questions usually with “yes” or “no” answers. In 1730, the first speech synthesizer model was designed and submitted to a contest by Christian Gottlieb Kratzenstein, a physics professor at Copenhagen University. His machine could produce the five vowels and was made up of resonant tools and a reed. Although Kratzenstein won the contest, his model was not practical since the speech anatomy he hypothesized was not correct.
In 1791, another inventor, Wolfgang von Kempelen, developed his Acoustic Mechanical Speech Machine after years of research on human speech production. Because of an earlier controversy (von Kempelen had announced earlier that he built a speaking, chess-playing machine but hid a human chess player inside the machine), he was not taken seriously at the time but his machine would go on to be the template for other speaking machines. One of those speaking machines was made by Charles Weathstone and was able to produce vowels and consonants and could “speak” some full words. After watching a demonstration of a talking machine in London in 1848, Alexander Mellvile Bell and his son Alexander Graham Bell successfully built their own speaking machine.
By the beginning of the 20th century, the introduction of electrical devices meant a shift to the development of electrical synthesizers. The first electrical device was made in 1922 and was able to produce only vowel sounds. Other electrical synthesizers were quickly produced and improved on but the most impressive was VODOR ( Voice Operating Demonstrator), a keyboard-operated synthesizer that was designed in 1937 at Bells Laboratories by Homer Dudley. It was the first successful attempt to recreate human speech electronically. Dudley had previously created the vocoder (or voice encoder), which was a device to analyze speech into its fundamental tones and resonances that could then drive a carrier signal to reconstruct the approximation of the original speech signal.
Today, vocoders are used in different technologies but most popularly in music to produce synthetic and robotic sounds. (Example: The vocoder was used in Numbers by Kraftwerk, a song that is credited as a major influence for genres like techno, new wave and early hip-hop and rap.)
By analyzing the sound of a modulator signal (usually the human voice), and splitting the modulator signal into many frequency bands and adjusting the level of each bandpass filter to that of the corresponding frequency in the modulator signal, the carrier is filtered so that the harmonic content that passes through is similar to that of the modulator signal. In 1961, John Larry Kelly Jnr. used the vocoder and an IBM 7094 computer to sing the song, Daisy Bell, which later inspired a scene in 2001: A Space Odyssey.
Other Electronic Devices
The invention of VODOR ushered in a golden age of speech synthesis systems. In 1951, Franklin Cooper created a Pattern Playback Synthesizer that could reconvert recorded spectrogram patterns into sound. This paved the way for the formant synthesizers PAT (Parametric Artificial Talker), and OVE (Orator Verbis Electris). In 1972, John Holmes introduced his formant synthesizer and it was so good that it was very hard for listeners to tell the difference between the synthesized and original versions.
Three years later, the first text-to-speech system was developed in Japan by Noriko Umeda in the Electrotechnical Laboratory, Japan. In 1981, Dennis Klatt introduced his KlattTalk System TTS (text-to-speech) which forms the basis for many synthesis systems today. When Stephen Hawking lost the ability to speak after an emergency tracheotomy in 1985, Klatt was able to create a speech synthesizer, Speech Plus CallText 5010 for him to communicate with. The famous Stephen Hawking voice is actually the voice of Klatt who used his own voice when he was building his synthesizer. By this time, a substantial amount of commercial speech synthesis systems were available, and two techniques were used to develop text-to-speech systems.
Formant Synthesis which replicated sounds using models of the human vocal tract.
Concatenation Synthesis used a large database of source sounds to generate patterns of sound.
Both of these techniques created robotic, artificial sounding voices that can not pass for human.
Deep Learning Based Speech Synthesis
These days, more sophisticated methods and algorithms are being used in speech synthesis to generate more natural speech. One of them, the Hidden Markov Model based Text to speech synthesis (HMM) is based on the generation of an optimal parameter sequence from subword HMMs. Although HMM is flexible with changing voice characteristics, the acoustic features are over smoothed making the generated voice sound muffled. Deep learning synthesis using neural networks trained on a large amount of labeled data are also being used for both speech recognition and synthesis.
Deep learning is a relatively new direction in the machine learning space. It can effectively capture the hidden internal structures of data and use more powerful modeling capabilities to characterize the data. The major difference between deep learning and HMM is that it can directly perform mapping from linguistic features to acoustic features with deep neural networks which have proven extraordinarily efficient at learning inherent features of data. There are different models based on the Deep Learning approach including Restrictive Boltzmann Machines for Speech Synthesis, Multi-distribution Deep Belief Networks for Speech Synthesis, Deep Bidirectional LSTM-based Speech Synthesis, and End-to-End Speech Synthesis.
Voice Synthesis Tech Today
Most people’s interactions with speech synthesis systems today are in the form of voice assistants and, more recently, voice generators. Virtual assistants like Microsoft’s Cortana, Apple’s Siri, Google’s Assistant, and Amazon’s Alexa are used by millions of people to automate repetitive tasks, improve productivity, and accommodate accessibility measures for people with disabilities. The speech synthesis of voice assistants is usually generated by employing high-level machine learning algorithms with the goal being synthesized speech that is as natural and intelligible as human speech.
Speech synthesis has come a long way from when Wolfgang von Kempelen developed his Acoustic Mechanical Speech Machine. For one, the quality of speech synthesized has improved dramatically and it is still improving as more innovative technology is introduced. Google’s WaveNet, for example, is a generative model that simulates the sound of speech at the lowest level possible using deep neural networks. Tacotron 2, another project by Google, also uses neural networks trained using only speech examples and corresponding text transcripts to generate natural sounding speech.
Kayte, Sangramsing N., and Monica Mal. “Hidden Markov Model based Speech Synthesis: A Review.” International Journal of Computer Applications, 2015, https://www.researchgate.net/publication/284139182_Hidden_Markov_Model_based_Speech_Synthesis_A_Review.
Lemmetty, Sami. “History and Development of Speech Synthesis.” Helsinki University of Technology, 1999, http://research.spa.aalto.fi/publications/theses/lemmetty_mst/chap2.html. Accessed 1 June 2023.
Lőrincz, Beáta, and Adriana Stan. Generating the Voice of the Interactive Virtual Assistant. Edited by Ali Soofastaei, IntechOpen, 2021, https://www.intechopen.com/chapters/74876. Accessed 1 June 2023.
Ning, Yishuang, et al. “A Review of Deep Learning Based Speech Synthesis.” MDPI, 27 September 2019, https://www.mdpi.com/2076-3417/9/19/4050. Accessed 1 June 2023.
Schroader, Manfred. “A Brief History Of Synthetic Speech.” Speech Communications, vol. 13, no. 1-2, 1993, pp. 231-237, https://www.sciencedirect.com/science/article/abs/pii/016763939390074U?via%3Dihub.
Schwarz, Diemo. “Current Research in Concatenative Sound Systems.” Proceedings of the International Computer Music Comference, 2005. http://recherche.ircam.fr/equipes/analyse-synthese/schwarz/publications/icmc2005/Schwarz_ICMC2005_Current-Research.pdf.
Story, Brad H. “History of Speech Synthesis.” 2019, pp. 9-33. https://www.researchgate.net/publication/342693675_History_of_speech_synthesis.
Terzopoulos, George, and Maya Satratzemi. “Voice Assistants and Smart Speakers in Everyday Life and Education.” Informatics in Education, vol. 19, no. 3, 2020, pp. 473-490.