Tutorial and Demo: How to Build a Voice AI Video Game in Python
In this article, we are going to transform an ordinary platformer game into one that can be controlled by your voice using Deepgram’s API. The focus here is more on how Deepgram can be integrated into the mechanics of a game and less on how to program a platformer in Python.
The game we will be working with is an infinite scrolling 2D platformer, where the player is initially controlled by a single space bar. The length of the space bar press determines the height and distance of the player’s jump. The game features platforms for the player to land on, and the player loses if they fall between the gaps in the platforms.
The base game itself is a Python/Pygame rewrite of a side-project I worked on in Javascript. We are not going to dive into the details of how the game is written, but rather provide a high level overview of the different components that’s involved in the functioning of the script.
Base Game Overview
There are three classes in the game, Player, Platform, and Game. To start, we will define a couple constants and import the required libraries.
The Player Class
The Player class is responsible for handling and updating all things related to the character on the screen — represented as a cube.
The update() method updates the player’s position based on its velocity, applies gravity, checks collision with platforms and finally handles jumping (when the spacebar is released) and charging (when the spacebar is pressed down) states.
The draw() method handles the player’s visual, responsible for putting the cube on the game window and changing its appearances when charging, giving a squeezing and glowing effect.
The class is shown below:
The Platform Class
The Platform class deals with everything about the platforms the player jumps on — they come in three flavors: static, moving, and sinking.
The update() method is where the magic happens for moving and sinking platforms. For moving platforms, it updates their horizontal position and flips their direction when they hit their movement limits. Sinking platforms start their descent after a short delay when the player lands on them.
The draw() method slaps the platform onto the game window, using different colors to distinguish between the three types — grey for static, blue for moving, and red for sinking platforms.
There’s the platform class.
The Game Class
The Game class chains everything together, orchestrating the whole show and keeping all the game elements in check.
The update() method is the heart of the game loop. It updates the player and all platforms, shifts the world when the player moves past the middle of the screen, kicks out off-screen platforms, spawns new ones, and checks if the player has fallen to their doom.
The draw() method is responsible for putting everything on the screen. It draws all platforms, the player, and the UI stuff like score and difficulty level.
The generate_platform() method is the platform factory, cranking out new platforms with random properties based on constants defined in the script as the game progresses. It’s where the difficulty setting comes into play, determining how likely you are to get those tricky moving or sinking platforms.
The run() method manages the main game loop. It handles events (like quitting or jumping), updates the game state, draws everything, and keeps the whole thing running at a smooth 60 frames per second.
Here is the Game class.
Putting it All Together
Running the game is as simple as initializing the Game class and calling the run method. But we also need to define some constants used by the classes at the top, which controls everything from the window size to the gravity to various platform generation parameters. Here’s what it would look like:
The only dependency here is Pygame and with that installed, you can run the game and play around with the parameters yourself and see how long you can last jumping across the infinite stretch of platforms (I got to 205!).
Adding the Deepgram Audio Magic
Here’s where the meat of the article comes in, implementing the audio controls. For python, I’ve found that instead of using the streaming feature, which would be convenient, sticking to the good old-fashioned transcription through an audio file is the most reliable. I experienced countless package and audio detection issues that’s not necessarily related to Deepgram but with my audio inputs and python packages.
The premise of the audio “controls” is based on the number of times a trigger word is spoken, and the frequency of that word will determine the jump force of the player.
For example, if the trigger word is “jump”, then the player would be sent flying if jump was repeated 10 times while a tiny jump would be executed if “jump” was only heard once.
Here’s how the audio mechanism is going to fit in the game:
Listening for audio: the game will listen for any audio from the input source (presumably a microphone) on a seperate thread while the main game is running.
Audio detection: When the player starts speaking, audio will be detected and if the volume is above a certain threshold, the script starts recording. The recording continues until a period of silence is detected, where the recording stops.
Post processing: The audio is amplified and saved into an audio file.
Speech to text: The file is then sent through Deepgram’s API for transcription, and a list of transcribed words is returned.
Command interpretation: The game looks at the text and counts the number of consecutive “jumps” (or whatever the trigger word is set to) present in the transcription.
Command execution: The game will then translate the frequency of the trigger word into how high/far the player will jump, then the jump is executed. While the player is in the air, audio recording is disabled until the player lands to avoid conflicting inputs.
We define a new class, AudioProcessor, that inherits from the threading.Thread class, which gives the ability for the class to run in the background in another thread to avoid blocking the main game loop.
The meat of the class lies in the run method. It continuously checks if the player is ready for audio input. When ready, it records audio, transcribes it, counts jumps, and triggers the player’s jump action.
Here’s what’s happening:
Player State Check: It first checks if the player is not jumping or charging. This prevents audio commands from interfering with ongoing actions.
Audio Recording: If the player is ready, it calls record_audio() to capture audio from the microphone.
Audio Saving: The recorded audio is immediately saved to a file using save_audio().
Transcription: The audio file is sent to Deepgram for transcription using transcribe_audio().
Command Interpretation: If transcription is successful, it counts the number of consecutive “jump” commands using count_consecutive_jumps().
Game State Update: The transcript and jump count are sent to the game to update its state.
Player Action: The audio_jump() method of the player is called with the jump count, triggering the jump action.
Thread Management: A small sleep is added to prevent the thread from consuming too much CPU.
The record_audio method will handle the logic for detecting speech, starting the recording, and ending the recording when silence is detected. The method cleverly uses a double-ended-queue to store small durations of audio prior to detecting any speech, this avoids any audio being cut off from the beginning. The recorded audio is amplified before being returned.
The save_audio method writes the recorded audio to a file:
This method creates a WAV file with the recorded audio data, which can then be sent to Deepgram.
The transcribe_audio method sends the audio file to Deepgram for transcription:
This method sends a POST request to Deepgram’s API with the audio file. If successful, it extracts and returns the transcribed text.
The count_consecutive_jumps method interprets the transcribed text to determine the jump count:
This method counts consecutive occurrences of “jump” or similar-sounding words at the beginning of the transcript. It’s designed to be forgiving of transcription errors and limits the maximum jump count to 10.
Integration with the Game
To integrate this audio system into the game, several modifications are made to the existing classes:
We will initialize the AudioProcessor class in the Game class’s __init__ method:
In the Game class, a new method to update the transcript and jump count:
In the Player class, a new method to handle audio-triggered jumps:
In the Game class draw_ui method, we can optionally add a new UI elements to display the transcript and jump count:
In the Game class run method, ensure the audio processor is properly stopped:
Finally, at the top of the file, we need to add new constants that the new audio processing capabilities require:
And that’s it! Remember to replace the constant DEEPGRAM_KEY with your actual API key obtained here. Register a free account and you will receive $200 in credits. Run the game with python file_name.py and voilà! You should be able to just yell “jump” into your microphone and watch your character fly through the infinite line of platforms.
The AudioProcessor class can be readily adapted to any game or application with a similar need for audio integration. We can simply replace the count_consecutive_jumps method that takes in the transcript with any other processing function needed for the specific application. Deepgram’s efficient, low-latency audio transcription API will do the rest.
The full code for the completed, audio-integrated game is available on Github.
Unlock language AI at scale with an API call.
Get conversational intelligence with transcription and understanding on the world's best speech AI platform.