Article·Tutorials·Jul 2, 2025

10 min read

How to Build a Speech-to-Text (STT) Note Taking App in Python

This tutorial aims not only to implement the basic core functionalities of an enterprise-greade speech-to-text (STT) note taking app, but also to reveal how you can customize and expand its capabilities for your personal use case.

10 min read

Basic Requirements Workflow Overview The Recording Function Speech To Text (STT) Function Using Deepgram’s API Intelligent Post Processing How do LLMs Generate Structured Outputs?Using Structured Outputs with Gemini’s API Using Ollama for Local Models Putting it All Together Conclusion and Extensions

Share this guide

By Zian (Andy) WangAI Content Fellow

UpdatedJul 2, 2025

PublishedJul 2, 2025

With the increasing pace of work, life, and society as a whole, the demand for technologies that help people save time has grown substantially. There are many apps and services out there that utilize machine learning to record, transcribe, and organize meeting notes. They are often associated with high costs or subscription models. However, the core components of an intelligent note-taking companion with enterprise speech-to-text capabilities are incredibly simple to build with little to no cost.

This tutorial aims to implement the basic core functionalities of an enterprise speech-to-text note-taking app and options to expand it further.

Basic Requirements

There are three main components to a speech-to-text note-taking app:

Recording function. The app should be able to record lectures, meetings, interviews, or podcasts and store the audio for processing. This can be done easily through existing libraries along with a functional microphone on the device on which the app will be run.
Speech to Text (STT) function. The app will then transcribe and store the recorded audio into an organized transcript. The transcript should contain accurate timestamps to indicate the temporal sequence of when sentences are spoken. Additionally, in the case of an interview, podcast, or any scenario that involves more than one speaker, the STT mechanism should also perform speaker diarization that labels each section of the speech with its respective speaker.
Intelligent processing. A meeting minute-like document will be generated containing bite-sized, but crucial information. This can be in the form of summarizations, segmenting the recorded audio into chapters, key points, and any potential action items.

This tutorial will provide a scalable implementation that can be extended and integrated into different platforms and apps that suit each individual’s wants and needs.

Workflow Overview

This is an overview of the app’s workflow using Python:

Python’s pyaudio and wave will be used to record audio on demand.
Deepgram’s enterprise-ready speech-to-text APIs will be used to transcribe the audio. The transcript will be automatically sectioned into paragraphs with appropriate timestamps and speaker labels if applicable.
We will use an LLM to generate a summary, table-of-contents with timestamps, as well as key points discussed in each section.
Implement any extra functionalities such as exporting the transcript and summarization document to a markdown file, generating hyperlinks between chapter titles and the actual transcript, etc.

We are going to use Python for this app, but most if not all the libraries and APIs used in this tutorial can be applied similarly in other popular languages such as JavaScript.

The Recording Function

As mentioned previously, pyaudio and wave are the libraries responsible for recording the audio. Of course, any audio recording can be used, such as a pre-recorded lecture from your phone or an audio file from an online meeting. The function below will start recording from the device’s microphone when the “enter” key is pressed and stop recording when the key is pressed again. Obviously, the trigger condition can be changed arbitrarily.

Due to the nature of the input() function being blocking, meaning that it will halt execution until the user provides input or presses the enter key, we cannot use a while loop and must perform the key press check in another thread. This is done with threading.thread, utilizing the check_for_stop function that will set the recording flag to false when the key press is detected in its own thread.

We then follow up with a while loop that reads the audio in chunks specified by the user until the recording flag is set to false.

The size of the chunk parameter acts as a trade-off between overhead and latency. The integer passed into the chunk parameter determines the number of samples that are processed at one time. With a lower chunk size, the recording function will be more responsive to stop commands but will face more CPU overhead due to the frequent processing. On the other hand, a larger chunk size will reduce overhead but increase the latency in detecting stop commands.

Finally, the wave library is used to save the audio data as a WAV file.

Speech To Text (STT) Function Using Deepgram’s API

Now with the audio file in hand, we can move on to implementing the speech-to-text (STT) functionality.

Deepgram’s speech recognition API gifts any user signing up with 200 dollars of free credit without credit cards or any strings attached. Users can sign up here.

Every dollar of credit accounts for about four hours of audio processing.

Deepgram’s Python SDK can be installed with pip install deepgram-sdk. The audio processing pipeline is concise and straightforward.

Remember to store your Deepgram API key in the “DEEPGRAM_API_KEY” environment variable by executing the command export DEEPGRAM_API_KEY="YOUR API KEY".

The PrerecordedOptions class specifies certain helpful processing steps that will be performed along with the audio transcription. In our use case, we will set the smart_format, diarize, and paragraph options to True.

More options can be found in the speech-to-text documentation page under the section “Formatting.”

The response will be returned in JSON format. There is much information nested within dictionaries and lists, but we are only interested in a couple of entries; namely the actual transcript, speaker, and the timestamp each paragraph was spoken at. Their location within the JSON response looks something like this:

Knowing the JSON structure, it’s not too difficult to extract the relevant information. We will format the timestamp within brackets like so: [mm:ss] along with the speaker information on the same line. Then the paragraph that follows will be written on an indented newline. Finally, the entire transcript will be written to a .txt file.

We use the .get method here in case the response has missing keys or returns an empty transcript; the second parameter specifies what to return in case the specified key was not found.

Intelligent Post Processing

There are many APIs and thousands of LLM models to choose from. In this tutorial, we will use Google’s Gemini API, which is completely free with generous limitations for an individual. For larger scale use cases, Ollama is a great local option that can be implemented with minimal code changes from Google’s Gemini API.

The LLM needs to complete three tasks: write a one-paragraph summary, segment the transcript into chapters, and identify any action items as well as any potential assignees. There is some work to be done with the prompt, especially the segmentation part of the speech.

We must strike an intricate balance to ensure that the model does not “overdo” by splitting every sentence into its own chapter with the slightest change of topic or “underdo” by including too few chapters for a content-rich transcript. We will also instruct the model to write a few concise and straightforward key points for each chapter, acting as a “TL;DR.”

After experimenting, here’s the prompt I came up with:

Depending on the model, the prompt may need to be adjusted for additional specificity and details.

How do LLMs Generate Structured Outputs?

However, merely prompting the model is not enough; we need the LLM to output its responses in a strict, specific format that can be processed by our code consistently and accurately every time. Luckily, Google’s Gemini API and most other LLM providers allow the model to generate in a “structured output” format that forces its outputs to follow a certain schema, such as JSON or XML.

Initially, direct prompting is used to instruct the LLM to output in a specific format. However, with the stochastic nature of these Transformer models, reliability is inconsistent at best. Today, most LLM providers utilize Finite State Machines (FSM) to ensure consistent structured output.

When generating structured output like JSON, the LLM’s typical token-by-token generation process is constrained by an FSM that tracks the “state” of the output being built. For example, when inside a JSON string, only certain characters are valid next tokens, and when after a property name, a colon must follow.

The FSM works by:

Tracking what “state” the generation is in (e.g., “inside object”, “after property name”, “inside array”)
For each state, defining which tokens are valid next steps
During token selection, applying logit biasing which adjusts the probability of every potential next token to make invalid tokens extremely unlikely or impossible.

This approach is different from regular prompting because it directly constrains the model’s token selection process at the probability distribution level, rather than just hoping the model follows instructions.

Using Structured Outputs with Gemini’s API

We’ll need a Gemini API key, which can be obtained here. We’ll first instantiate a client with our API key.

client = genai.Client(api_key=YOUR_API_KEY)

Then, we will define our desired JSON schema with classes, inheriting from pydantic's BaseModel class. Pydantic is a Python library that will automatically convert the LLM’s structured output to Python datatypes that are specified in the class.

We will pass the TranscriptSummary class to the LLM. Within the class, we are nesting the Chapter and ActionItem classes which contain their own properties such as timestamp, title, key points, task, and assignee.

Prompting the model is a simple one-liner:

Altogether, we have the analyze_transcript function. We’re using the gemini-2.5-flash-preview-04-17 model, which is one of the best models in its class with amazing efficiency. The model currently (as of May 14th, 2025) has a rate limit of 500 requests per day, 250,000 tokens per minute, and 10 requests per minute.

Using Ollama for Local Models

For many developers seeking enterprise use cases, local models are a common choice due to the flexibility and security that they offer compared to other providers. Fortunately, the popular local LLM provider Ollama has a very similar syntax that performs structured output compared to Google’s Gemini.

Putting it All Together

With each individual piece of the app done, we’ll put everything together into a basic command-line app.

We’ll first define a couple of command-line arguments:

“file”: The file path to a pre-recorded audio file if available
“dir”: The output directory for the generated transcript and summary files
“model”: The LLM model to use for generating the chapters, summary, action items, and key points

The recording logic is up next. We’ll check if the user has provided a pre-recorded audio file. If so, we will use it; if not, we will run the record_audio function. We can also optionally copy the pre-recorded audio file to the same output directory for consistency.

Then, we can just run the functions defined earlier in order to transcribe the audio, write it to a transcript file with proper formatting, and feed it into an LLM for the intelligent post processing.

The recording function will go first with an option for the user to provide a recorded audio file through a command-line argument.

Finally, we can organize the LLM’s output into a nicely formatted markdown file and write it to the same output directory as the transcript and the recorded audio.

And that’s it! A fully functional intelligent note-taking app with its core functionality implemented.

Below is a sample result from the famous TED talk about procrastination. Since this is a TED talk, there are no action items.

For individuals, the app costs little to nothing, offers customizability, arguably containing much higher value than any subscription-based services that essentially accomplish the same thing with a nicer UI. For enterprise-scale applications, the implementation demonstrates the core functionality of the app with minimal lines of code, and it can be easily scaled or packaged for further expansion.

Conclusion and Extensions

There are many directions that can be taken with the app. A UI wrapper such as streamlit can be integrated smoothly, and the Deepgram API offers options in other languages with very similar usage syntax as Python.

We can also package the implementation into popular note-taking apps such as Obsidian and Notion.

Furthermore, internal hyperlinks in Obsidian can link to specific paragraphs identified by a caret symbol followed by a unique string of numbers and letters (i.e., ^abc123). We can utilize this to our advantage by appending a unique identifier after every paragraph in our transcript that corresponds to that paragraph’s timestamp.

Then, within our structured output format, we can add an additional field to the Chapter that stores these identifiers. In the prompt of the LLM, we’ll tell the model to populate the field with the identifier that corresponds to the timestamp that it selected for the start of a new chapter. This way, we can navigate to the exact spot on the transcript when we click on the chapter timestamp generated by our LLM.

There are endless possibilities for potential extensions; this tutorial is just the start.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.