How to Build a Voice AI Agent Using Deepgram and OpenAI: A Step-by-Step Guide
Frustrating customer support experiences—long wait times, confusing responses, or unresolved issues—are all too common. What if, instead, you interacted with a smart voice AI agent that understood you, responded in real-time, and tailored its approach to your needs?
AI voice agents, powered by advanced speech technologies, are changing how businesses interact with customers. They provide efficient, human-like, and personalized experiences.
In this tutorial, you’ll learn how to build a fully functional AI voice agent powered by Deepgram's voice APIs. Whether you're creating an AI concierge, virtual assistant, or customer support bot, this guide provides everything you need:
Integrating Deepgram's Speech-to-Text (STT) and Text-to-Speech (TTS) APIs.
Implementing audio intelligence features like topic identification, summaries, and sentiment analysis for dynamic, empathetic responses.
Generating actionable conversation responses using OpenAI’s GPT 3.5 turbo.
Step-by-step Python code with copy-pastable examples and a working demo.
By the end, you'll have a working AI voice agent and a clear understanding of delivering human-like interactions using voice AI tools. 🚀
The Idea: A Customer Support AI Agent
Customer support interactions are often stressful. Users struggle to communicate their problems clearly, while agents work to provide swift and practical solutions. Miscommunication can quickly escalate frustration.
With Deepgram’s STT and TTS APIs, you can build a smart AI voice agent that:
Transcribes conversations in real-time, ensuring nothing gets missed.
Understands emotional cues using sentiment analysis to adapt tone—responding empathetically when customers sound frustrated or stressed.
Highlights key topics like billing issues or technical queries to streamline support.
Summarizes issues, for human agents in the loop or to save in a database.
Use an LLM to generate an appropriate response to the user.
Say the response back to the customer, all in one call.
System Flow
What is Deepgram?
Deepgram is a developer-friendly Voice AI platform that delivers:
Speech-to-Text (STT): Real-time, highly accurate transcriptions.
Text-to-Speech (TTS): High-quality audio responses for applications.
Audio Intelligence: Sentiment analysis to detect emotional cues in speech and summarization to distill lengthy conversations into concise overviews.
Voice Agent API: A unified voice-to-voice API that enables natural-sounding conversations between humans and machines.
With Deepgram, you can build AI solutions that analyze, respond, and improve user interactions. Its speed, accuracy, and customization make it ideal for AI voice agent applications.
Next Steps
In the following sections, we’ll dive into the code and show you how to:
Set up Deepgram’s APIs.
Implement real-time transcription and TTS.
Integrate sentiment analysis.
Build a demo AI voice agent.
Let’s jump right into the code! 💻
Step 1: Set Up the Environment for Your Voice AI Agent Application
To get started, you need API keys from both Deepgram and OpenAI. These keys allow your application to interact with their servers for speech recognition, audio generation, and language understanding.
1. Get a Deepgram API Key
To get a free API key, follow the steps below:
Step 1: Sign up for a Deepgram account on this page.
Step 2: Click on “Create a New API Key.”
Step 3: Add a title to your API key and copy and paste it somewhere safe for later use.
2. Get an OpenAI API Key
Step 1: Go to the OpenAI API page.
Step 2: Generate a new API key for use with ChatGPT.
3. Install the Required Libraries
Next, create your virtual environment. Feel free to use your favorite virtual environment manager; we use conda in this tutorial.
Ensure you’re using Python 3.10+ (it‘s the minimum requirement for the deepgram-sdk):
🚨To follow along, find the complete code walkthrough in this GitHub repository.
Using Deepgram's Features in the App
Our application consists of three main components:
utils.py: Helper functions for Deepgram and OpenAI integrations. It allows you to easily reuse or modify these functions to suit your application’s needs.
create_customer_voice_inquiry.py: Generates audio .mp3 files using Deepgram TTS for testing.
demo.py: Main application logic that transcribes, analyzes, and responds to audio input.
The preview of the project structure:
Here’s a quick overview of the helper functions you’ll use in utils.py:
get_transcript(): Transcribes an audio file with Deepgram STT.
ask_openai(): Generates a response using OpenAI ChatGPT.
save_speech_summary(): Converts AI responses into audio files with Deepgram TTS.
You will explore the components of each helper function in the following sections.
Step 2: API Initialization and Declarations
For the helper functions in utils.py to work, ensure you initialize the Deepgram and OpenAI clients correctly.
Start by importing libraries and checking if the API keys are in the right environment. Then use the keys to initialize the clients.
Next, define a system_prompt that would provide context and instructions for the LLM.
The Deepgram settings text_options and speak_options provide the appropriate model, language, and voices for Deepgrams TTS and STT models.
Check the documentation to learn more about the different options available in Deepgram’s TTS and STT.
Step 3: Query ChatGPT
Next, you need a convenient function that will return a response from ChatGPT. The function below ask_openai() inputs a message (transcript from user query) as a string and returns a string for the response.
In the function, pass the variable messages (includes the system and user prompt). If you want to use more robust models from OpenAI, see the models docs page. To change additional ChatGPT behaviors, such as temperature, visit the chat completions documentation.
Here, we use model="gpt-3.5-turbo":
If this function needs to provide more capability for your application, OpenAI also allows for function calling. This would be a convenient way for ChatGPT to call specific application logic based on input and output prompts.
For example, if your AI agent requires you to call code logic in your application to reference a ticket or database, ChatGPT can return functions it would call based on the input prompts it has received from the customer.
Step 4: Query Deepgram’s Voice APIs
Once you define the function to call ChatGPT, write functions to help interface with Deepgram.
In utils.py, define the following functions:
get_transcript(): Returns a JSON of Deepgram's transcription given an audio file.
get_topics(): Each transcript includes topics related to the discussions. This function returns a list of all unique topics in the transcript.
get_summary(): Returns the summary of the transcript as a string.
save_speech_summary(): This function will use Deepgram to write and save text to an audio file.
Step 5: Source a Customer Inquiry to Test the AI Agent
Before the demo, source a customer inquiry or recording. For demo purposes, create an audio recording with a tool of your choice, or simply use Deepgram’s TTS to make a recording.
To accomplish this, simply import the Deepgram client with its SpeakOptions.
Next, pass the transcript you want to convert to an audio file. Then, provide a filename for the audio recording.
Listen to the sample audio generated below:
Finally, initialize a Deepgram client, pass the transcript to the client, and have the client write us an audio file.
Step 6: Run the Demo
With all the helper functions defined and an audio file sourced, you can now develop your application's logic.
Ingest a sample customer inquiry as AUDIO_FILE. The logic of this application can be broken into five steps:
Open the file to be read and send this payload to Deepgram for processing using read().
Call get_transcript() and pass the payload to Deepgram for processing.
Pass the transcript to ask_openai(), your chat AI agent.
[Optional] Although not mandatory in the primary response, using helper functions such as get_topics() and get_summary() can help with organizing or rerouting customer queries in an AI agent application.
Pass ChatGPT’s response to save_speech_summary(), which will write ChatGPT’s response to an audio file for viewing.
Now run the script from the command line:
Running the script, you should get an output with extracted topics, a summary of the inquiry, and the agent’s answer:
The script will produce output.wav, containing the spoken version of ChatGPT’s reply:
That’s it! You now have a working script you can integrate with your interface or applications. The Deepgram repository has more examples and open-source community showcases.
Further Improvements You Can Make to The Voice AI Agent Application
The demo showcased how to create an essential AI voice agent using Deepgram’s STT and TTS capabilities alongside OpenAI’s ChatGPT. However, the possibilities don’t stop here.
Check below for some ways to improve the demo and make the AI agents robust:
1. Fine-Tuning Large Language Models (LLMs) for Specific Needs
Integrating custom large language models (LLMs) fine-tuned on specific business data or industry knowledge allows the AI agent to provide more accurate and contextual responses.
Fine-tuning improves the agent’s ability to:
Address niche customer concerns (e.g., telecom billing issues).
Understand specialized terminology.
Deliver highly relevant and precise solutions.
For instance, you can fine-tune OpenAI’s GPT models using your customer support logs and FAQs. Tools like Hugging Face or OpenAI’s fine-tuning API provide straightforward workflows for this.
2. Integrating RAG Systems and External Data Sources
One way to get customer-specific data is to connect the AI agent to Retrieval-Augmented Generation (RAG) systems, CRM tools, or support ticketing platforms. This enables:
Real-time access to historical data for personalized responses.
Seamless tracking of individual customer issues.
Efficient resolution of recurring concerns.
Architecture Example:
Deepgram STT → Transcribes audio input.
RAG System (e.g., FAISS, Weviate, or Pinecone) → Queries relevant documents or historical data.
OpenAI GPT → Generates the response based on retrieved knowledge.
Deepgram TTS → Converts the response into audio.
3. Providing Actionable Insights for Management
AI voice agents can aggregate and analyze conversation data to provide valuable business insights. Using features like summarization and sentiment analysis, the agent can:
Identify recurring customer pain points.
Detect trends in user sentiment (e.g., frustration peaks).
Generate reports that inform operational improvements and service refinements.
Incorporating these changes turns AI voice agents into strong tools that help customers and give businesses helpful information they can use.
Conclusion: How to Build a Voice AI Agent with Deepgram and OpenAI
Deepgram’s advanced speech recognition and text-to-speech technologies offer developers the tools to revolutionize customer service with AI voice agents. In this guide, we demonstrated how to:
Transcribe audio input in real time using Deepgram’s STT API.
Generate intelligent, human-like responses with OpenAI GPT.
Deliver audio responses back to customers using Deepgram’s TTS API.
These agents simplify troubleshooting, reduce manual workloads, and deliver personalized customer experiences at scale.
Next Steps
Extend this demo by integrating real-time streaming APIs for continuous conversation flow.
Deploy the agent on a cloud platform (AWS, Google Cloud) for scalability.
Explore fine-tuning OpenAI models for domain-specific knowledge.
Additional Resources
To continue exploring Deepgram’s tools and features, check out the following resources:
Deepgram API Playground: Test and experiment with Deepgram’s features interactively.
Speech-to-Text Getting Started Docs: A beginner-friendly guide to Deepgram APIs.
Deepgram Tutorials: Explore step-by-step tutorials for integrating Deepgram into various applications.
Deepgram Discussions Forum: Join the community to ask questions and share projects.
Deepgram Discord: Engage with Deepgram developers and community members in real-time.
Unlock language AI at scale with an API call.
Get conversational intelligence with transcription and understanding on the world's best speech AI platform.