Article·Oct 27, 2025

Using STT for Meeting Transcripts, Action Items, and Auto-Summarization

In this article, we’ll walk you through how to build a meeting insights application using a Speech-to-Text (STT) model combined with a Large Language Model (LLM). Specifically, we’ll use Deepgram Nova-3, one of the most accurate STT models available, and OpenAI’s GPT-5 for language understanding and summarization.

12 min read

By Stephen Oladele

Last Updated

You have an idea for a startup or a new product. The idea is simple: you want to build an intelligent app that can summarize all your meetings, extract action items, generate notes, and allow you to chat with your meeting history.

But then the questions start piling up. What would it take, technically, to build something like this? Which model should you use? A multimodal model? An API service? All of these questions are running through your mind.

Luckily, we’ve got you covered! In this article, we’ll walk you through how to build a meeting insights application using a Speech-to-Text (STT) model combined with a Large Language Model (LLM). Specifically, we’ll use Deepgram Nova-3, one of the most accurate STT models available, and OpenAI’s GPT-5 for language understanding and summarization.

By the end of this article, you’ll walk away with:

  • A working application that can generate insights from meetings.
  • A clear understanding of how to use metadata-enriched STT models for real-world applications.
  • A practical knowledge of how to use LLMs to generate overviews, action items, and notes from transcripts.

You can find the full code and app styling in this repo.

System Architecture and Workflow

The architecture of our meeting insights application consists of two main components:

  • STT Model.
  • Large Language Model.

Both the STT model and the LLM are hosted on the backend. Users interact with the app through a frontend interface where they can upload their meeting recordings.

Here’s how it works step by step:

  1. A user uploads the meeting recording from the frontend. and sent to the backend.
  2. On the backend, the STT model processes the recording and transcribes the audio into text (a transcript).
  3. The LLM processes the transcript, generates a meeting overview, extracts action items, and produces a comprehensive meeting report.
  4. The frontend displays the meeting report to the user.

Implementation Details

When choosing models for this architecture, a few criteria are important.

For the STT model, look for:

For the LLM, look for:

With these considerations in place, we are ready to implement the architecture.

👉 Recommended Resource: Which Speech Recognition Model is Best for My Business?

Implementing STT with Deepgram Nova-3

We’ll begin implementing our meeting insights application by setting up STT. As mentioned in the architecture, we need an STT model that is:

  • Highly accurate.
  • Supports speaker diarization.
  • Provides timestamps for extra context.

For this, we’ll use Deepgram’s Nova-3, one of the most accurate STT models available.

Step 1: Install the Deepgram SDK

pip install deepgram

Next, grab your Deepgram API key and set it as an environment variable:

export DEEPGRAM_API_KEY="your_api_key_here"

Step 2: Create an STT Script

Create a file called stt.py and start with the imports:

from deepgram import DeepgramClient, PrerecordedOptions, FileSource
from datetime import timedelta

Then, initialize the Deepgram client:

deepgram = DeepgramClient()

Step 3: Transcribe a Recording

Now let’s configure the client to transcribe a prerecorded audio file:

filepath = "meeting_audio.wav"

# --- Transcribe with Deepgram ---
with open(filepath, "rb") as f:
    buffer_data = f.read()

payload: FileSource = {"buffer": buffer_data}

options = PrerecordedOptions(
    model="nova-3",
    smart_format=True,
    utterances=True,  # enable segmentation + timestamps
    diarize=True,     # enable speaker labeling
)

response = deepgram.listen.rest.v("1").transcribe_file(payload, options)

Here:

  • utterances=True enables segmentation and timestamps.
  • diarize=True ensures speakers are labeled.

You can access the utterances directly via response.results.utterances

However, the raw response isn’t very readable. Let’s add two helper functions.


Step 4: Formatting the Transcript

def format_timestamp(seconds: float) -> str:
    td = timedelta(seconds=seconds)
    hours, remainder = divmod(td.seconds, 3600)
    minutes, seconds = divmod(remainder, 60)
    milliseconds = int(td.microseconds / 1000)
    return f"{hours:02}:{minutes:02}:{seconds:02}.{milliseconds:03}"


def save_transcript(utterances, path="transcript.txt"):
    full_text = []
    for utt in utterances:
        speaker = getattr(utt, "speaker", 0)
        start = format_timestamp(utt.start)
        end = format_timestamp(utt.end)
        utterance = f"Speaker {speaker} [{start} - {end}]: {utt.transcript}\n"
        full_text.append(utterance)

    with open(path, "w", encoding="utf-8") as f:
        f.write("".join(full_text))

    return "".join(full_text)
  • format_timestamp converts raw seconds into a human-readable format (HH:MM:SS.MS).
  • save_transcript extracts utterances, attaches speaker labels and timestamps, saves them to `transcript.txt`, and returns the formatted text.

With this, you now have a working STT pipeline that takes a meeting recording, transcribes it, and produces a readable transcript.

Next, we’ll integrate a Language Model (GPT-5) to generate meeting overviews, extract action items, and produce bullet-point notes.

Leveraging OpenAI for Language Understanding

The next step is to analyze transcript.txt using GPT-5. The LLM will generate:

  • A clear meeting overview
  • Action items
  • Concise bullet-point notes

Install the SDK. Then set your OpenAI API key as an environment variable:

pip install openai
export OPENAI_API_KEY="your_api_key_here"

Move to the script. In this script you define a prompt, load the transcript produced by the STT model, and pass it into GPT-5:

from openai import OpenAI

openai_client = OpenAI()

# --- Analyze with GPT-5 ---
instructions = """
From the transcript, generate the following:
1. A clear meeting overview
2. Action items
3. Bullet-point notes
"""

with open("transcript.txt", "r", encoding="utf-8") as f:
    transcript = f.read()

gpt_response = openai_client.responses.create(
    model="gpt-5",
    input=[
        {"role": "system", "content": instructions},
        {"role": "user", "content": transcript},
    ],
)

insights = gpt_response.output_text

# Save results for results.html to fetch
with open("results.txt", "w", encoding="utf-8") as f:
    f.write(insights)

The model will generate structured meeting insights saved to a file for later use.

STT + LLM = Meeting Insights App

Now, put the STT and LLM code together to create a Meeting Insights App. This will be a simple web application built with Flask, hosted on the server, and accessible through a clean web interface.

Find the full code and styling in this repo.

  • server.py → contains the logic for both the STT and LLM pipelines.
  • static/ → holds the static frontend files (index.html and style.css).
  • uploads/ → stores uploaded meeting recordings.

Our application will expose three routes:

  • / (index): serves the home page where users can upload their recordings.
  • /upload: handles file uploads, transcribes the audio with Nova-3, generates insights with GPT-5, and then redirects to results.
  • /results: displays the meeting insights back to the user.

Server Setup

We start by installing Flask: pip install flask

… and getting the API keys for Deepgram and OpenAI:

from flask import Flask, request, jsonify, redirect, url_for
from deepgram import DeepgramClient, PrerecordedOptions, FileSource
from openai import OpenAI
from datetime import timedelta
import os

app = Flask(__name__)

deepgram = DeepgramClient()
openai_client = OpenAI()

UPLOAD_FOLDER = "uploads"
os.makedirs(UPLOAD_FOLDER, exist_ok=True)

Define the helper functions:

def format_timestamp(seconds: float) -> str:
    td = timedelta(seconds=seconds)
    hours, remainder = divmod(td.seconds, 3600)
    minutes, seconds = divmod(remainder, 60)
    milliseconds = int(td.microseconds / 1000)
    return f"{hours:02}:{minutes:02}:{seconds:02}.{milliseconds:03}"

def save_transcript(utterances, path="transcript.txt"):
    full_text = []
    for utt in utterances:
        speaker = getattr(utt, "speaker", 0)
        start = format_timestamp(utt.start)
        end = format_timestamp(utt.end)
        utterance = f"Speaker {speaker} [{start} - {end}]: {utt.transcript}\n"
        full_text.append(utterance)

    with open(path, "w", encoding="utf-8") as f:
        f.write("".join(full_text))

    return "".join(full_text)

Upload Route

This is where the app triggers both STT and LLM pipelines:

@app.route("/upload", methods=["POST"])
def upload_file():
    if "file" not in request.files:
        return jsonify({"error": "No file uploaded"}), 400

    file = request.files["file"]
    filepath = os.path.join(UPLOAD_FOLDER, file.filename)
    file.save(filepath)

    # --- Transcribe with Deepgram ---
    with open(filepath, "rb") as f:
        buffer_data = f.read()

    payload: FileSource = {"buffer": buffer_data}
    options = PrerecordedOptions(
        model="nova-3",
        smart_format=True,
        utterances=True,
        diarize=True,
    )
    response = deepgram.listen.rest.v("1").transcribe_file(payload, options)
    transcript = save_transcript(response.results.utterances)

    # --- Analyze with GPT-5 ---
    instructions = """
    From the transcript, generate the following:
    1. A clear meeting overview
    2. Action items
    3. Bullet-point notes
    """

    gpt_response = openai_client.responses.create(
        model="gpt-5",
        input=[
            {"role": "system", "content": instructions},
            {"role": "user", "content": transcript},
        ],
    )

    insights = gpt_response.output_text

    with open("results.txt", "w", encoding="utf-8") as f:
        f.write(insights)

    return redirect(url_for("results"))

Results Route

@app.route("/results", methods=["GET"])
def results():
    if not os.path.exists("results.txt"):
        return "No results available yet."

    with open("results.txt", "r", encoding="utf-8") as f:
        data = f.read()

    return f"""
    <html>
    <head>
      <link rel="stylesheet" href="/static/style.css">
    </head>
    <body class="app">
      <h1 class="header">Meeting Insights</h1>
      <div class="transcript-box">{data}</div>
      <a href="/" class="file-input" style="margin-top:2rem;">Upload Another File</a>
    </body>
    </html>
    """

Index Route

@app.route("/")
def index():
    return app.send_static_file("index.html")

Frontend (index.html)

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8" />
  <title>Meeting Insights App</title>
  <link rel="stylesheet" href="/static/style.css" />
</head>
<body class="app">
  <header class="home-header">
    <h1 class="main-title">Meeting Insights App</h1>
    <p class="subtitle">Upload your meeting recording and let AI do the heavy lifting.</p>
  </header>

  <main class="upload-section">
    <form class="file-upload" action="/upload" method="post" enctype="multipart/form-data">
      <label class="file-label">
        <input type="file" name="file" class="file-input" required />
      </label>
      <button type="submit" class="file-input">Upload</button>
    </form>
  </main>

  <footer class="footer">
    <p>Powered by Deepgram Nova-3 & GPT-5</p>
  </footer>
</body>
</html>

Finally, run the app:

if __name__ == "__main__":
    app.run(debug=True)

With this, you have a working end-to-end Meeting Insights App. You can upload a recording, let Deepgram Nova-3 transcribe it, and then have GPT-5 generate overviews, action items, and notes.

Here’s a quick test using a short video meeting:

After uploading, the insights look like this:

You can find the full code and styling in this repo.

Production-Ready Implementation Ideas

While your Meeting Insights app is fully functional, there are several ways to take it closer to a production-ready solution:

1. Direct Integrations with Meeting Platforms

Major platforms like Google Meet, Zoom, and Microsoft Teams provide APIs to access recorded meetings. For example, Zoom’s Recording Archive Files Completed webhook alerts developers when a recording is ready for download.

By integrating with these APIs, you can automatically pull recordings as soon as a meeting ends, removing the need for manual uploads.

2. Automated Bots

Another approach is to build bots that can join meetings in real time, record the session, and forward the audio to the insights pipeline. Tools like Selenium can help automate this process. For example, check out this Google Meet Bot that demonstrates how such automation can be implemented.

3. Browser Extensions

A lightweight alternative is to provide users with a browser extension that records their meeting sessions on demand and sends them directly to the backend for analysis.

No matter which method you choose, the foundation is the same: you need a reliable speech-to-text model to capture conversations accurately and a powerful language model to analyze them and surface meaningful insights.

Conclusion: Using STT for Meeting Transcripts, Action Items, and Auto-Summarization

In this article, we built a working Meeting Insights App that combines Deepgram’s Nova-3 for transcription and GPT-5 for analysis. Nova-3’s diarization and timestamping capabilities provide crucial context that empowers GPT-5 to generate clear overviews, actionable next steps, and concise notes.

What we’ve shown here is just the beginning. By combining speech-to-text with large language models, you can reimagine how teams capture, organize, and act on the knowledge locked inside meetings. Whether it’s seamless API integrations, bots that attend calls on your behalf, or one-click browser extensions, the possibilities are endless.

The message is clear: with the right AI tools, every meeting can turn into actionable insights without the manual effort.

Sign up on the Deepgram Console and start testing with a free API key and $200 in free credits.