Move over Beethoven. This tutorial will use Python and the Deepgram API speech-to-text audio transcription to play a piano with your voice. The song we’ll play is the first few phrases of Lady Gaga’s Bad Romance. It’s a simple piece in C Major, meaning no flats and sharps! We’ll only use pitches C, D, E, F, G, A, and B, and no black keys. What a beautiful chance for someone learning how to play the piano without a keyboard, tapping into the power of voice to play music!
After running the project, we'll see the GIF below when running the project as a PyGame application. A window will appear, and the piano will play the song. We'll hear the notes, which also light up on the keyboard.
Let’s get started!
What We’ll Need to Play Voice-Controlled Music Using AI
This project requires macOS but is also possible with a Windows or Linux machine. We’ll also use Python 3.10 and other tools like FluidSynth and Deepgram Python SDK speech-to-text audio transcription.
FluidSynth
We need to install FluidSynth, a free, open-source MIDI software synthesizer that creates sound in digital format, usually for music. MIDI or Musical Instrument Digital Interface is a protocol that allows musical gear like computers, software, and instruments to communicate with one another. FluidSynth uses SoundFont files to generate audio. These files have samples of musical instruments like a piano that play MIDI files.
There are various options to install FluidSynth on a Mac. In this tutorial, we’ll use Homebrew for the installation. After installing Homebrew, run this command anywhere in the terminal:
Now that FluidSynth is installed, let’s get our Deepgram API Key.
Deepgram API Key
We need to grab a Deepgram API Key from the console. It’s effortless to sign up and create an API Key here. Deepgram is an AI automated speech recognition voice-to-text company that allows us to build applications that transcribe speech-to-text. We’ll use Deepgram’s Python SDK and the Numerals feature, which converts a number from written format to numerical format. For example, if we say the number “three”, it would appear in our transcript as “3”.
One of the many reasons to choose Deepgram over other providers is that we build better voice applications with faster, more accurate transcription through AI Speech Recognition. We offer real-time transcription and pre-recorded speech-to-text. The latter allows uploading a file that contains audio voice data for transcribing.
Now that we have our Deepgram API Key let’s set up our Python AI piano project so we can start making music!
Create a Python Virtual Environment
Make a Python directory called play-piano to hold our project. Inside of it, create a new file called piano-with-deepgram.py, which will have our main code for the project.
We need to create a virtual environment and activate it so we can pip install our Python packages. We have a more in-depth article about virtual environments on our Deepgram Developer blog.
Activate the virtual environment after it’s created and install the following Python packages from the terminal.
Let’s go through each of the Python packages.
deepgram-sdk is the Deepgram Python SDK installation that allows us to transcribe speech audio, or voice, to a text transcript.
python-dotenv helps us work with environment variables and our Deepgram API KEY, which we’ll pull from the .env file.
mingus is a package for Python used by programmers and musicians to make and play music.
pygame is an open-sourced Python engine to help us make games or other multimedia applications.
sounddevice helps get audio from our device’s microphone and records it as a NumPy array.
scipy helps writes the NumPy array into a WAV file.
We need to download a few files, including keys.png, which is the image of the piano GUI. The other file we need is the Yamaha-Grand-ios-v1.2 from this site. A SoundFont contains a sample of musical instruments; in our case, we’ll need a piano sound.
The Code to Play Voice-Controlled Music with Python and AI
We’ll only cover the Deepgram code in this section but will provide the entire code for the project at the end of this post.
Deepgram Python Code Explanation
This line of code prompts the user to create a name of the audio file so that the file will save in .wav format:
Once the file is created the function record_song_with_voice gets called inside the get_deepgram_transcript method.
Inside the record_song_with_voice function, this line records the audio.
Where duration is the number of seconds it takes to record an audio file, and fs represents the sampling frequency. We set both of these as constants near the top of the code.
Then we write the voice recording to an audio file using the .write() method. That line of code looks like this:
Once the file is done writing, this message will print to the terminal ”Finished.....Please check your output file", which means the recording is complete.
The function get_deepgram_transcript is where most of the magic happens. Let’s walk through the code.
Here we initialize the Deepgram Python SDK. That’s why it’s essential to grab a Deepgram API Key from the console.
We store our Deepgram API Key in a .env file like so:
The abc123 represents the API Key Deepgram assigns us.
Next, we call the external function record_song_with_voice(), which allows us to record our voice and create a .wav file that will pass into Deepgram as pre-recorded audio.
Finally, we open the newly created audio file in binary format for reading. We provide key/values pairs for buffer and a mimetype using a Python dictionary. The buffer’s value is audio, the object we assigned it in this line with open(AUDIO_FILE, "rb") as audio: The mimetype value is audio/wav, which is the file format we’re using, which one of 40+ different file formats that Deepgram supports. We then call Deepgram and perform a pre-recorded transcription in this line: response = await deepgram.transcription.prerecorded(source, {"punctuate": True, "numerals": True}). We pass in the numerals parameter so that when we say a number, it will process in numeric form.
The last bit of code to review is the get_note_data function, doing precisely that: getting the note data.
We have a Python dictionary with keys from ‘1’ to ‘7’ corresponding to every note in the C Major scale. For example, when we say the number 1 that plays the note C, saying the number 2 will play the ‘D’ note, and so on:
Here’s how that would look on a piano. Each note in C Major is labeled, and located above is a corresponding number. The numbers 1 - 7 are critical, representing a single note in our melody.
Next, we get the numerals from the Deepgram pre-recorded transcript get_numbers = await get_deepgram_transcript().
We then create an empty list called data and check if there are any results in the parsed response we get back from Deepgram. If results exist, we get that result and store it in data:
Example output may look like the below, depending on which song we create.
We notice that the word key in the above response correlates to a numeral we speak into the microphone when recording the song.
We can now create a new list that maps each numeral to a note on the piano, using a list comprehension return [note_dictonary [x['word']] for x in data].
To run the project, we’ll need all the code. See the end of this post.
Then in our terminal, we can run the project by typing:
Now, use our voice to say the following numerals, which correspond to piano notes, to play the first few phrases from Lady Gaga’s song Bad Romance:
12314 3333211 12314 3333211
Next Steps to Extend the Voice-Controlled Python AI Music Example
Congratulations on getting to the end of the tutorial! We encourage you to try and extend the project to do the following:
Play around with the code to play songs in different octaves
Play voice-controlled music that has flats and sharps
Tweak the code to play voice-controlled music using whole notes and half notes
When you have your new masterpiece, please send us a Tweet at @DeepgramAI and showcase your work!
The Entire Python Code for the Voice-Controlled Music Example
If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .
Unlock language AI at scale with an API call.
Get conversational intelligence with transcription and understanding on the world's best speech AI platform.