Build a Real-Time Transcription App with React and Deepgram

⏩ TL;DR
To build a real-time transcription application, you need two main components: the frontend (UI) and the backend (the server hosting the speech-to-text (STT) models).
A user interface that captures microphone input and lets users interact with the app.
A transcription backend that processes the audio and returns results in real time while the user speaks.
Deepgram paired with React is an excellent combo for building responsive, real-time transcription applications.
See the Github Repo here


Imagine opening a web page, and as soon as it loads, a sleek application appears. You start speaking into your microphone, and your words instantly appear on the screen, almost as if the app is reading your mind. Fast. Accurate. Seamless. This is a real-time transcription app, and in this tutorial, you’ll learn exactly how to build one yourself.
We'll use Deepgram’s Nova-3, a high-accuracy, domain-customizable, low-latency speech-to-text model, to bring this idea to life.
We'll stream live audio data to Deepgram's transcription server using their WebSocket API for real-time transcription.
But a powerful backend is only half the equation. To match the speed and responsiveness of the transcription server, we need a modern frontend framework. That’s where React comes in. React helps us build a dynamic user interface (UI) that updates in sync with the user's speech.
Here’s what we’ll be building:
Along the way, you'll learn how to:
Use navigator.mediaDevices.getUserMedia() to access the user's microphone
Record audio with the MediaRecorder API
Stream audio in real time using the browser’s native WebSocket API
Display live transcription results inside a React application
Let’s dive in and build your very own real-time transcription app.
Note: You can find the completed Github repository here.
Architecture of a Real-Time Transcription App
A real-time transcription application consists of two main components: the frontend and the backend.
The frontend is responsible for the user interface, which the user interacts with directly. More importantly, it also manages access to the user's microphone.
As the user speaks, their voice is captured and broken into small audio chunks, typically around 250 milliseconds each. These chunks are then streamed to the backend in real time.
The backend hosts a speech-to-text model, which receives each audio chunk and responds with the corresponding transcript. This continuous exchange of audio and transcribed text happens over a real-time communication protocol—in this case, WebSocket—for rapid transcription.
Live Audio with Deepgram’s WebSocket API
Instead of building and deploying our transcription backend, take advantage of Deepgram’s transcription endpoints, which provide access to their state-of-the-art speech-to-text models.
Deepgram offers two primary types of transcription endpoints:
Pre-recorded audio: This endpoint accepts complete audio files (e.g., MP3, WAV) and returns their full transcription after processing.
Live Audio: This endpoint accepts streaming audio in real time and responds continuously with transcriptions as the audio is received.
For our real-time application, we’ll use Live Audio (designed for low-latency transcription and supports chunked audio input).
The WebSocket URL is:
This is a secure WebSocket connection that stays open to receive audio data in chunks. For each chunk received, the API responds with a JSON payload containing details such as:
The transcribed text
The confidence score of the transcription
A list of individual words with their timestamps
For example, if you send an audio chunk that says “Hello, world! Welcome to Deepgram!”, the API will respond with the following:
To interact with the Live Audio endpoint, you can use any WebSocket client from any programming language.
However, since we’re building a web application, we’ll use the browser’s built-in WebSocket API to establish a real-time connection with Deepgram.
Core Browser APIs for Real-Time Transcription
Before we start building our React application, let’s take a closer look at the core browser APIs that make real-time transcription possible in the browser.
There are three main APIs we’ll be working with:
MediaDevices
MediaRecorder
WebSocket
First up is the MediaDevices API, which lets us access the user's microphone, after asking for permission, of course. If the user allows it, we can capture the audio stream using getUserMedia().
To get just audio, we pass { audio: true } as a constraint:
Once we have the audio stream, we pass it to the MediaRecorder API, which will handle breaking it up into chunks that we can send to Deepgram’s server:
To chunk the audio, we use the .start() method and pass in a value, 250 milliseconds, in our case:
This tells the recorder to generate a new audio chunk every 250 ms.
Next, we need to establish a WebSocket connection to the Live Audio Endpoint using the WebSocket API. This is what allows us to stream audio in real time and get live transcriptions back:
In the WebSocket URL, we specify the model we want to use, in this case, nova-3. For authentication, we use our Deepgram API key passed in as a token.
Once the connection is open, we register a callback. Inside that callback, we add a dataavailable event listener to the MediaRecorder so that each time a new chunk is ready, we send it over the WebSocket:
We also need to handle incoming messages from Deepgram, which are the transcription results. So we define another callback to log those to the console:
And that’s it! These are the core browser APIs we’ll use to build our real-time transcription app.
Now let’s move on to wiring it all up inside a React component.
Build the React Application
Let’s get started with building the React transcription application. The transcription app you will build is very simple. It has a single button, and when you click it, it starts transcribing your audio. Clicking the button again stops the transcription.
Step 1: Bootstrap the Application with Vite
To get started, we’ll use Vite to scaffold our React application. Run npm create vite@latest and follow the prompts to create a React app.
Once the app is created, remove all the content from App.css and index.css to get rid of the default styles. Then update the App.jsx file with the code below:
When you run the app, you should see this basic UI:


Now that we have our app bootstrapped, let’s start implementing the functionality.
Step 2: Define State and References
Our app needs to manage state and hold references to the browser APIs. We need two states:
transcript to store the text we get back from Deepgram.
isTranscribing to track whether the app is currently recording and transcribing.
We’ll also define three references for non-UI-related data:
socketRef for the WebSocket connection
mediaRecorderRef for the MediaRecorder instance
streamRef for the user’s microphone stream
We can now define all of these in our component:
Step 3: Define the Transcription Event
We want a situation where clicking the button starts transcription, and the UI reflects that transcription has begun. To achieve this, let's add an event handler that triggers when the button is clicked.
This event handler will be called handleTranscriptionToggle. Let's implement a simple version of this handler:
This version of the event handler doesn't include any transcription logic; it simply demonstrates how the transcription logic would be implemented.
When the user clicks the button for the first time, it goes to the else condition since isTranscribing is initially false. This sets the transcript and isTranscribing values.
The next click takes the user to the if condition, doing the opposite. Let's update our JSX to reflect these changes:
Now, let's implement the actual transcription event:
This combines the logic from our simplified handleTranscriptionToggle function with the Web APIs.
When the user clicks this function for the first time, it establishes a connection to the Deepgram Live Audio endpoint. We store the Deepgram API key in the Vite environment variable as VITE_DEEPGRAM_API_KEY.
The browser APIs are accessed via their references. The callback that is called when the socket initially opens is the same as the one we defined earlier. https://github.com/Neurl-LLC/React_DeepGram_Transcription_App/tree/main/src
In contrast, the callback that handles incoming messages utilizes the setTranscript function to append new transcripts to the transcript state. It also uses setIsTranscribing to hange isTranscribing to true.
Put everything together to have the complete component, as seen in the Github repo linked here. Note that the code in the repository adds styling through the App.css file and the index.css file. The styling make the app look like this:


Conclusion: Build a Real-Time Transcription App with React and Deepgram
The application we’ve built is simple but powerful. With real-time transcription powered by Deepgram’s Nova 3 model, you can use it out of the box for things like interviews, meetings, or podcasts. And because Nova-3 supports real-time multilingual transcription, this app isn’t limited to just English; it can handle many languages on the fly.
This isn’t just an app. It’s a foundation. Whether you're building a voice AI agent, an accessibility tool, a live captioning service, or your twist on real-time audio experiences, this gives you a solid place to start.
Resources for Further Building
Here are several helpful references and resources for learning more about building with Deepgram:
Unlock language AI at scale with an API call.
Get conversational intelligence with transcription and understanding on the world's best speech AI platform.