Article·AI Engineering & Research·Apr 11, 2025

How to Build a Voice AI Agent Using Deepgram and OpenAI: A Step-by-Step Guide

In this tutorial, you’ll learn how to build a fully functional AI voice agent powered by Deepgram's voice APIs.

Featured Image for How to Build a Voice AI Agent Using Deepgram and OpenAI: A Step-by-Step Guide

By Stephen Oladele

Last Updated

Frustrating customer support experiences—long wait times, confusing responses, or unresolved issues—are all too common. What if, instead, you interacted with a smart voice AI agent that understood you, responded in real-time, and tailored its approach to your needs?

AI voice agents, powered by advanced speech technologies, are changing how businesses interact with customers. They provide efficient, human-like, and personalized experiences.

In this tutorial, you’ll learn how to build a fully functional AI voice agent powered by Deepgram's voice APIs. Whether you're creating an AI concierge, virtual assistant, or customer support bot, this guide provides everything you need:

  • Integrating Deepgram's Speech-to-Text (STT) and Text-to-Speech (TTS) APIs.
  • Implementing audio intelligence features like topic identification, summaries, and sentiment analysis for dynamic, empathetic responses.
  • Generating actionable conversation responses using OpenAI’s GPT 3.5 turbo.
  • Step-by-step Python code with copy-pastable examples and a working demo.

By the end, you'll have a working AI voice agent and a clear understanding of delivering human-like interactions using voice AI tools. 🚀

The Idea: A Customer Support AI Agent

Customer support interactions are often stressful. Users struggle to communicate their problems clearly, while agents work to provide swift and practical solutions. Miscommunication can quickly escalate frustration.

With Deepgram’s STT and TTS APIs, you can build a smart AI voice agent that:

  1. Transcribes conversations in real-time, ensuring nothing gets missed.
  2. Understands emotional cues using sentiment analysis to adapt tone—responding empathetically when customers sound frustrated or stressed.
  3. Highlights key topics like billing issues or technical queries to streamline support.
  4. Summarizes issues, for human agents in the loop or to save in a database.
  5. Use an LLM to generate an appropriate response to the user.
  6. Say the response back to the customer, all in one call.

System Flow

The high-level flow of the customer support agent. Input: User speaks → Deepgram STT transcribes speech → LLM generates a response based on the transcribed query, and Deepgram’s audio intelligence tool gets summary and sentiment analysis → TTS API converts the response back to speech.

What is Deepgram?

Deepgram is a developer-friendly Voice AI platform that delivers:

  • Speech-to-Text (STT): Real-time, highly accurate transcriptions.
  • Text-to-Speech (TTS): High-quality audio responses for applications.
  • Audio Intelligence: Sentiment analysis to detect emotional cues in speech and summarization to distill lengthy conversations into concise overviews.
  • Voice Agent API: A unified voice-to-voice API that enables natural-sounding conversations between humans and machines.

Enterprise AI voice agents require a modern tech stack. | Image Source: Introducing Deepgram’s Voice Agent API.

With Deepgram, you can build AI solutions that analyze, respond, and improve user interactions. Its speed, accuracy, and customization make it ideal for AI voice agent applications.

 🛝 Explore the Deepgram Playground to test features directly: Deepgram Playground.

Next Steps

In the following sections, we’ll dive into the code and show you how to:

  1. Set up Deepgram’s APIs.
  2. Implement real-time transcription and TTS.
  3. Integrate sentiment analysis.
  4. Build a demo AI voice agent.

Let’s jump right into the code! 💻

Step 1: Set Up the Environment for Your Voice AI Agent Application

To get started, you need API keys from both Deepgram and OpenAI. These keys allow your application to interact with their servers for speech recognition, audio generation, and language understanding.

1. Get a Deepgram API Key

To get a free API key, follow the steps below:

Step 1: Sign up for a Deepgram account on this page.

Step 2: Click on “Create a New API Key.”

Step 3: Add a title to your API key and copy and paste it somewhere safe for later use.

2. Get an OpenAI API Key

Step 1: Go to the OpenAI API page.

Step 2: Generate a new API key for use with ChatGPT.

⚠️ Store your keys securely: Use a .env file to save your keys and load them using python-dotenv.

3. Install the Required Libraries

Next, create your virtual environment. Feel free to use your favorite virtual environment manager; we use conda in this tutorial. 

Ensure you’re using Python 3.10+ (it‘s the minimum requirement for the deepgram-sdk):

# Check your python version
python --version

# Create and activate a virtual environment
conda create --name deepgram-ai-agent python=3.11
conda activate deepgram-ai-agent

# Install required libraries
pip3 install deepgram-sdk python-dotenv openai

🚨To follow along, find the complete code walkthrough in this GitHub repository.

Using Deepgram's Features in the App

Our application consists of three main components:

  • utils.py: Helper functions for Deepgram and OpenAI integrations.  It allows you to easily reuse or modify these functions to suit your application’s needs.
  • create_customer_voice_inquiry.py: Generates audio .mp3 files using Deepgram TTS for testing.
  • demo.py: Main application logic that transcribes, analyzes, and responds to audio input.

The preview of the project structure:

demo/
├── utils.py  # Helper functions for Deepgram and OpenAI APIs
├── create_customer_voice_inquiry.py  # Generates demo audio files
└── demo.py  # Main application logic

Here’s a quick overview of the helper functions you’ll use in utils.py:

  • get_transcript(): Transcribes an audio file with Deepgram STT.
  • ask_openai(): Generates a response using OpenAI ChatGPT.
  • save_speech_summary(): Converts AI responses into audio files with Deepgram TTS.

You will explore the components of each helper function in the following sections.

Step 2: API Initialization and Declarations

For the helper functions in utils.py to work, ensure you initialize the Deepgram and OpenAI clients correctly. 

Start by importing libraries and checking if the API keys are in the right environment. Then use the keys to initialize the clients.

import os
import json
import openai
from dotenv import load_dotenv
from deepgram import DeepgramClient, PrerecordedOptions, SpeakOptions

load_dotenv()

DG_API_KEY = os.getenv("DG_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

if not DG_API_KEY or not OPENAI_API_KEY:
    raise ValueError("Please set the DG_API_KEY and/or OPENAI_API_KEY environment variable.")

# Initialize the clients
deepgram = DeepgramClient(DG_API_KEY)
openai_client = openai.OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY")
)

Next, define a system_prompt that would provide context and instructions for the LLM.

# Define the system prompt for OpenAI
system_prompt = """
You are a helpful and friendly customer service assistant for a cell phone provider.
Your goal is to help customers with issues like:
- Billing questions
- Troubleshooting their mobile devices
- Explaining data plans and features
- Activating or deactivating services
- Transferring them to appropriate departments for further assistance

Maintain a polite and professional tone in your responses. Always make the customer feel valued and heard.
"""

The Deepgram settings text_options and speak_options provide the appropriate model, language, and voices for Deepgrams TTS and STT models. 

Check the documentation to learn more about the different options available in Deepgram’s TTS and STT. 

# Set Deepgram options for TTS and STT
text_options = PrerecordedOptions(
    model="nova-2", # Use Deepgram's 'nova-2' model for speech-to-text
    language="en", # Set the language to English
    summarize="v2", # Generate a short summary of the conversation
    topics=True, # Identify the main topics discussed
    intents=True, # Detect the user's intent
    smart_format=True, # Enable smart formatting for punctuation and capitalization
    sentiment=True, # Analyze the sentiment of the speaker
)

speak_options = SpeakOptions(
    model="aura-asteria-en", # Use Deepgram's 'aura-asteria-en' model for text-to-speech
    encoding="linear16", # Set the audio encoding
    container="wav" # Set the audio container format to WAV
)

Step 3: Query ChatGPT

Next, you need a convenient function that will return a response from ChatGPT. The function below ask_openai() inputs a message (transcript from user query) as a string and returns a string for the response.

In the function, pass the variable messages (includes the system and user prompt). If you want to use more robust models from OpenAI, see the models docs page. To change additional ChatGPT behaviors, such as temperature, visit the chat completions documentation.

Here, we use model="gpt-3.5-turbo":

def ask_openai(prompt):
    """
    Send OpenAI API a prompt, returns a response back.
    """
    try:
        response = openai_client.chat.completions.create(
            model="gpt-3.5-turbo", # Use OpenAI's 'gpt-3.5-turbo' model; you can change this: https://platform.openai.com/docs/models#models-overview
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": prompt}
            ],
            temperature=0.7 # Controls the randomness of the response (0.7 provides a balance between deterministic and creative outputs)
        )
        return response.choices[0].message.content
    except openai.error.OpenAIError as e:
        return f"An error occurred: {e}"

If this function needs to provide more capability for your application, OpenAI also allows for function calling. This would be a convenient way for ChatGPT to call specific application logic based on input and output prompts. 

For example, if your AI agent requires you to call code logic in your application to reference a ticket or database, ChatGPT can return functions it would call based on the input prompts it has received from the customer.

Step 4: Query Deepgram’s Voice APIs

Once you define the function to call ChatGPT, write functions to help interface with Deepgram. 

In utils.py, define the following functions:

get_transcript(): Returns a JSON of Deepgram's transcription given an audio file.

def get_transcript(payload, options=text_options):
    """
    Returns a JSON of Deepgram's transcription given an audio file.
    """
    response = deepgram.listen.rest.v("1").transcribe_file(payload, options).to_json()
    return json.loads(response)

get_topics(): Each transcript includes topics related to the discussions. This function returns a list of all unique topics in the transcript.

def get_topics(transcript):
    """
    Returns back a list of all unique topics in a transcript.
    """
    topics = set()  # Initialize an empty set to store unique topics

    # Traverse through the JSON structure to access topics
    for segment in transcript['results']['topics']['segments']:
        # Iterate over each topic in the current segment
        for topic in segment['topics']:
            # Add the topic to the set
            topics.add(topic['topic'])
    return topics

get_summary(): Returns the summary of the transcript as a string.

def get_summary(transcript):
    """
    Returns the summary of the transcript as a string.
    """
    return transcript['results']['summary']['short']

save_speech_summary(): This function will use Deepgram to write and save text to an audio file.

def save_speech_summary(transcript, options=speak_options):
    """
    Writes an audio summary of the transcript to disk.
    """
    s = {"text": transcript}
    filename = "output.wav"
    response = deepgram.speak.rest.v("1").save(filename, s, options)

Step 5: Source a Customer Inquiry to Test the AI Agent

Before the demo, source a customer inquiry or recording. For demo purposes, create an audio recording with a tool of your choice, or simply use Deepgram’s TTS to make a recording.

To accomplish this, simply import the Deepgram client with its SpeakOptions.

import os
import logging
from deepgram.utils import verboselogs

from deepgram import (
    DeepgramClient,
    SpeakOptions,
)

Next, pass the transcript you want to convert to an audio file. Then, provide a filename for the audio recording.

SPEAK_TEXT = {"text": "Hi, I’m reaching out because I noticed some unexpected charges on my phone bill this month, and I need some help figuring it out. My bill is usually pretty consistent, but this time, there’s an extra fee labeled as 'Additional Services' that I don’t recognize. I didn’t sign up for anything new, and I want to make sure I’m not being charged for something by mistake. Can you help me understand what’s going on and how we can fix it? Oh, and by the way, I also noticed that my last payment hasn’t been reflected yet—it looks like there might be a delay or an issue processing it. Could we check on that as well? Thanks for your help!"}

filename = "sample.mp3"

Listen to the sample audio generated below:

How to Build a Voice AI Agent - Sample

00:00
00:00

How to Build a Voice AI Agent - Sample

Finally, initialize a Deepgram client, pass the transcript to the client, and have the client write us an audio file.

API_KEY = os.getenv("DG_API_KEY")
if not API_KEY:
    raise ValueError("Please set the DG_API_KEY environment variable.")

# STEP 1 Create a Deepgram client using the API key from environment variables
deepgram = DeepgramClient(API_KEY)

# STEP 2 Call the save method on the speak property
options = SpeakOptions(
    model="aura-asteria-en",
)

response = deepgram.speak.rest.v("1").save(filename, SPEAK_TEXT, options)
print(response.to_json(indent=4))

Step 6: Run the Demo

With all the helper functions defined and an audio file sourced, you can now develop your application's logic. 

Ingest a sample customer inquiry as AUDIO_FILE. The logic of this application can be broken into five steps:

  1. Open the file to be read and send this payload to Deepgram for processing using read().
  2. Call get_transcript() and pass the payload to Deepgram for processing.
  3. Pass the transcript to ask_openai(), your chat AI agent.
  4. [Optional] Although not mandatory in the primary response, using helper functions such as get_topics() and get_summary() can help with organizing or rerouting customer queries in an AI agent application.
  5. Pass ChatGPT’s response to save_speech_summary(), which will write ChatGPT’s response to an audio file for viewing.

📝 NOTE: In a more practical setting, Deepgram’s text-to-speech API allows for audio streaming. In our case, the audio response is saved to disk to reduce varying OS and environment requirements.

import utils
from deepgram import FileSource


# Path to the audio file and API Key
AUDIO_FILE = "sample.mp3"

def main():
    try:
        # STEP 1: Ingest the audio file
        with open(AUDIO_FILE, "rb") as file:
            buffer_data = file.read()
  
 # Use FileSource to provide audio data to Deepgram's servers
        payload: FileSource = {
            "buffer": buffer_data,
        }

        # STEP 2: Get the transcript of the voice file
        customer_inquiry = utils.get_transcript(payload)

        # STEP 3: Send this information to OpenAI to respond.
        # Extract the transcribed text from the Deepgram response
        transcribed_text = customer_inquiry["results"]["channels"][0]["alternatives"][0]["transcript"]
        agent_answer = utils.ask_openai(transcribed_text)

        # STEP 4: Print responses (optional) that can be used for integration with an app or stored in a customer database for analytics
        print('Topics:', utils.get_topics(customer_inquiry))
        print('Summary:', utils.get_summary(customer_inquiry))

        # STEP 5: Take the OpenAI response and write this out as an audio file to generate an audio response using Deepgram's TTS
        print("Agent Answer:", agent_answer)
        utils.save_speech_summary(agent_answer)

    except Exception as e:
        print(f"Exception: {e}")


if __name__ == "__main__":
    main()

Now run the script from the command line:

python demo.py

Running the script, you should get an output with extracted topics, a summary of the inquiry, and the agent’s answer:

$ python3 demo.py
           
Topics: {'Phone bill issues'}
Summary: The customer is experiencing charges on their phone bill and is confused about why they were charged. They mention not signing up for anything new and want to ensure they are not being charged for anything by mistake. The customer also noticed a delay in their last payment and asks the representative to check on it.
Agent Answer: Hello! Thank you for reaching out to us about the unexpected charges on your phone bill. I understand how important it is to have a clear understanding of your charges, and I'm here to assist you with that.

Let's start by looking into the additional services fee that you don't recognize. I will need to review your account to see what this charge is for and if there has been any mistake. To do that, could you please provide me with your account number or phone number associated with your account?

Regarding the delay in reflecting your last payment, I apologize for any inconvenience this may have caused. Let me also check on the status of your payment to ensure it has been processed correctly. Please bear with me while I look into these issues for you. Thank you for your patience.

The script will produce output.wav, containing the spoken version of ChatGPT’s reply:

How to Build a Voice AI Agent - Output

00:00
00:00

How to Build a Voice AI Agent - Output

That’s it! You now have a working script you can integrate with your interface or applications. The Deepgram repository has more examples and open-source community showcases.

👉 Remember to find the complete code walkthrough in this GitHub repository.

Further Improvements You Can Make to The Voice AI Agent Application

The demo showcased how to create an essential AI voice agent using Deepgram’s STT and TTS capabilities alongside OpenAI’s ChatGPT. However, the possibilities don’t stop here. 

Check below for some ways to improve the demo and make the AI agents robust:

1. Fine-Tuning Large Language Models (LLMs) for Specific Needs

Integrating custom large language models (LLMs) fine-tuned on specific business data or industry knowledge allows the AI agent to provide more accurate and contextual responses. 

Fine-tuning improves the agent’s ability to:

  • Address niche customer concerns (e.g., telecom billing issues).
  • Understand specialized terminology.
  • Deliver highly relevant and precise solutions.

For instance, you can fine-tune OpenAI’s GPT models using your customer support logs and FAQs. Tools like Hugging Face or OpenAI’s fine-tuning API provide straightforward workflows for this.

2. Integrating RAG Systems and External Data Sources

One way to get customer-specific data is to connect the AI agent to Retrieval-Augmented Generation (RAG) systems, CRM tools, or support ticketing platforms. This enables:

  • Real-time access to historical data for personalized responses.
  • Seamless tracking of individual customer issues.
  • Efficient resolution of recurring concerns.

Architecture Example:

  1. Deepgram STT → Transcribes audio input.
  2. RAG System (e.g., FAISS, Weviate, or Pinecone) → Queries relevant documents or historical data.
  3. OpenAI GPT → Generates the response based on retrieved knowledge.
  4. Deepgram TTS → Converts the response into audio.

3. Providing Actionable Insights for Management

AI voice agents can aggregate and analyze conversation data to provide valuable business insights. Using features like summarization and sentiment analysis, the agent can:

  • Identify recurring customer pain points.
  • Detect trends in user sentiment (e.g., frustration peaks).
  • Generate reports that inform operational improvements and service refinements.

Incorporating these changes turns AI voice agents into strong tools that help customers and give businesses helpful information they can use.

Conclusion: How to Build a Voice AI Agent with Deepgram and OpenAI

Deepgram’s advanced speech recognition and text-to-speech technologies offer developers the tools to revolutionize customer service with AI voice agents. In this guide, we demonstrated how to:

  • Transcribe audio input in real time using Deepgram’s STT API.
  • Generate intelligent, human-like responses with OpenAI GPT.
  • Deliver audio responses back to customers using Deepgram’s TTS API.

These agents simplify troubleshooting, reduce manual workloads, and deliver personalized customer experiences at scale.

Next Steps

Additional Resources

To continue exploring Deepgram’s tools and features, check out the following resources:

  1. Deepgram API Playground: Test and experiment with Deepgram’s features interactively.
  2. Speech-to-Text Getting Started Docs: A beginner-friendly guide to Deepgram APIs.
  3. Deepgram Tutorials: Explore step-by-step tutorials for integrating Deepgram into various applications.
  4. Deepgram Discussions Forum: Join the community to ask questions and share projects.
  5. Deepgram Discord: Engage with Deepgram developers and community members in real-time.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.