Article·Tutorials·Jun 2, 2025

Build a Real-Time Transcription App with React and Deepgram

To build a real-time transcription application, you need two main components: the frontend (UI) and the backend (the server hosting the speech-to-text (STT) models).

10 min read
Featured Image for Build a Real-Time Transcription App with React and Deepgram

By Stephen Oladele

By Eteimorde Youdiowei

Last Updated

⏩ TL;DR

  • To build a real-time transcription application, you need two main components: the frontend (UI) and the backend (the server hosting the speech-to-text (STT) models).
  • A user interface that captures microphone input and lets users interact with the app.
  • A transcription backend that processes the audio and returns results in real time while the user speaks.
  • Deepgram paired with React is an excellent combo for building responsive, real-time transcription applications.
  • See the Github Repo here

Imagine opening a web page, and as soon as it loads, a sleek application appears. You start speaking into your microphone, and your words instantly appear on the screen, almost as if the app is reading your mind. Fast. Accurate. Seamless. This is a real-time transcription app, and in this tutorial, you’ll learn exactly how to build one yourself.

We'll use Deepgram’s Nova-3, a high-accuracy, domain-customizable, low-latency speech-to-text model, to bring this idea to life. 

We'll stream live audio data to Deepgram's transcription server using their WebSocket API for real-time transcription.

But a powerful backend is only half the equation. To match the speed and responsiveness of the transcription server, we need a modern frontend framework. That’s where React comes in. React helps us build a dynamic user interface (UI) that updates in sync with the user's speech.

Here’s what we’ll be building:

Along the way, you'll learn how to:

  • Use navigator.mediaDevices.getUserMedia() to access the user's microphone
  • Record audio with the MediaRecorder API
  • Stream audio in real time using the browser’s native WebSocket API
  • Display live transcription results inside a React application

Let’s dive in and build your very own real-time transcription app.

Note: You can find the completed Github repository here.

Architecture of a Real-Time Transcription App

A real-time transcription application consists of two main components: the frontend and the backend

The frontend is responsible for the user interface, which the user interacts with directly. More importantly, it also manages access to the user's microphone.

As the user speaks, their voice is captured and broken into small audio chunks, typically around 250 milliseconds each. These chunks are then streamed to the backend in real time.

The backend hosts a speech-to-text model, which receives each audio chunk and responds with the corresponding transcript. This continuous exchange of audio and transcribed text happens over a real-time communication protocol—in this case, WebSocket—for rapid transcription.

Live Audio with Deepgram’s WebSocket API

Instead of building and deploying our transcription backend, take advantage of Deepgram’s transcription endpoints, which provide access to their state-of-the-art speech-to-text models.

Deepgram offers two primary types of transcription endpoints:

  • Pre-recorded audio: This endpoint accepts complete audio files (e.g., MP3, WAV) and returns their full transcription after processing.
  • Live Audio: This endpoint accepts streaming audio in real time and responds continuously with transcriptions as the audio is received.

For our real-time application, we’ll use Live Audio (designed for low-latency transcription and supports chunked audio input).

The WebSocket URL is:

wss://api.deepgram.com/v1/listen

This is a secure WebSocket connection that stays open to receive audio data in chunks. For each chunk received, the API responds with a JSON payload containing details such as:

For example, if you send an audio chunk that says “Hello, world! Welcome to Deepgram!”, the API will respond with the following:

{
  "channel": {
    "alternatives": [
      {
        "transcript": "Hello, world! Welcome to Deepgram!",
        "confidence": 0.98,
        "words": [
          {
            "word": "hello",
            "start": 0.1,
            "end": 0.5,
            "confidence": 0.99,
            "punctuated_word": "Hello,"
          },
          {
            "word": "world",
            "start": 0.6,
            "end": 0.8,
            "confidence": 0.98,
            "punctuated_word": "world!"
          },
          {
            "word": "welcome",
            "start": 0.9,
            "end": 1.2,
            "confidence": 0.97,
            "punctuated_word": "Welcome"
          },
          {
            "word": "to",
            "start": 1.3,
            "end": 1.4,
            "confidence": 0.99,
            "punctuated_word": "to"
          },
          {
            "word": "deepgram",
            "start": 1.5,
            "end": 1.9,
            "confidence": 0.98,
            "punctuated_word": "Deepgram!"
          }
        ]
      }
    ]
  },
  "metadata": {
    "model_info": {
      "name": "nova-2",
      "version": "1.0.0",
      "arch": "transformer"
    },
    "request_id": "987fcdeb-51a2-43b7-91e4-c95bafcda21a",
    "model_uuid": "123e4567-e89b-12d3-a456-426614174000"
  },
  "type": "Results",
  "duration": 2,
  "start": 0,
  "is_final": true,
  "speech_final": true
}

To interact with the Live Audio endpoint, you can use any WebSocket client from any programming language. 

However, since we’re building a web application, we’ll use the browser’s built-in WebSocket API to establish a real-time connection with Deepgram.

Core Browser APIs for Real-Time Transcription

Before we start building our React application, let’s take a closer look at the core browser APIs that make real-time transcription possible in the browser. 

There are three main APIs we’ll be working with:

  • MediaDevices
  • MediaRecorder
  • WebSocket

First up is the MediaDevices API, which lets us access the user's microphone, after asking for permission, of course. If the user allows it, we can capture the audio stream using getUserMedia()

To get just audio, we pass { audio: true } as a constraint:

const stream = await navigator.mediaDevices.getUserMedia({ audio: true });

Once we have the audio stream, we pass it to the MediaRecorder API, which will handle breaking it up into chunks that we can send to Deepgram’s server:

const mediaRecorder = new MediaRecorder(stream, { mimeType: 'audio/webm' });

To chunk the audio, we use the .start() method and pass in a value, 250 milliseconds, in our case:

mediaRecorder.start(250)

This tells the recorder to generate a new audio chunk every 250 ms.


Next, we need to establish a WebSocket connection to the Live Audio Endpoint using the WebSocket API. This is what allows us to stream audio in real time and get live transcriptions back:

const socket = new WebSocket('wss://api.deepgram.com/v1/listen?model=nova-3', [ 'token', DEEPGRAM_API_KEY ]);

In the WebSocket URL, we specify the model we want to use, in this case, nova-3. For authentication, we use our Deepgram API key passed in as a token.

Once the connection is open, we register a callback. Inside that callback, we add a dataavailable event listener to the MediaRecorder so that each time a new chunk is ready, we send it over the WebSocket:

socket.onopen = () => {
  mediaRecorder.addEventListener('dataavailable', (event) => {
    if (socket.readyState === WebSocket.OPEN) {
      socket.send(event.data);
    }
  });  
  mediaRecorder.start(250);
};

We also need to handle incoming messages from Deepgram, which are the transcription results. So we define another callback to log those to the console:

socket.onmessage = (message) => {
  const received = JSON.parse(message.data);
  const result = received.channel.alternatives[0]?.transcript;
    if (result) {
      console.log(result);
    }
};

And that’s it! These are the core browser APIs we’ll use to build our real-time transcription app. 

Now let’s move on to wiring it all up inside a React component.

Build the React Application

Let’s get started with building the React transcription application. The transcription app you will build is very simple. It has a single button, and when you click it, it starts transcribing your audio. Clicking the button again stops the transcription.

Step 1: Bootstrap the Application with Vite

To get started, we’ll use Vite to scaffold our React application. Run npm create vite@latest and follow the prompts to create a React app.

Once the app is created, remove all the content from App.css and index.css to get rid of the default styles. Then update the App.jsx file with the code below:

function App() {
  return (
    <div className="app">
      <h1 className="header">Real-Time Transcription</h1>
  
      <button className="toggle-button">
        Start Transcription
      </button>
  
      <div className="transcript-box">
        Listening
      </div>
    </div>
  );  
}

export default App;

When you run the app, you should see this basic UI:

Now that we have our app bootstrapped, let’s start implementing the functionality.

Step 2: Define State and References

Our app needs to manage state and hold references to the browser APIs. We need two states:

  • transcript to store the text we get back from Deepgram.
  • isTranscribing to track whether the app is currently recording and transcribing.

We’ll also define three references for non-UI-related data:

  • socketRef for the WebSocket connection
  • mediaRecorderRef for the MediaRecorder instance
  • streamRef for the user’s microphone stream

We can now define all of these in our component:

function App() {
  const [transcript, setTranscript] = useState('');
  const [isTranscribing, setIsTranscribing] = useState(false);
  const socketRef = useRef(null);
  const mediaRecorderRef = useRef(null);
  const streamRef = useRef(null);

  return (
    <div className="app">
      <h1 className="header">Real-Time Transcription</h1>
  
      <button className="toggle-button">
        Start Transcription
      </button>
  
      <div className="transcript-box">
        Listening
      </div>
    </div>
  );  
}

export default App;

Step 3: Define the Transcription Event

We want a situation where clicking the button starts transcription, and the UI reflects that transcription has begun. To achieve this, let's add an event handler that triggers when the button is clicked.

This event handler will be called handleTranscriptionToggle. Let's implement a simple version of this handler:

const handleTranscriptionToggle = async () => {
  if (isTranscribing) {
    console.log("End Transcription");
    setIsTranscribing(false);
  } else {
      setTranscript("This is the transcription!");  
      setIsTranscribing(true);
  }
};

This version of the event handler doesn't include any transcription logic; it simply demonstrates how the transcription logic would be implemented. 

When the user clicks the button for the first time, it goes to the else condition since isTranscribing is initially false. This sets the transcript and isTranscribing values.

The next click takes the user to the if condition, doing the opposite. Let's update our JSX to reflect these changes:

function App() {
  const [transcript, setTranscript] = useState('');
  const [isTranscribing, setIsTranscribing] = useState(false);
  const socketRef = useRef(null);
  const mediaRecorderRef = useRef(null);
  const streamRef = useRef(null);
  
  const handleTranscriptionToggle = async () => {
    if (isTranscribing) {
      console.log("End Transcription");
      setIsTranscribing(false);
    } else {
        setTranscript("This is the transcription!");  
        setIsTranscribing(true);
    }
  };
  
  return (
    <div className="app">
      <h1 className="header">Real-Time Transcription</h1>
  
      <button onClick={handleTranscriptionToggle} className="toggle-button">
        {isTranscribing ? 'Stop Transcription' : 'Start Transcription'}
      </button>
      
     <div className="transcript-box">
       {transcript || (isTranscribing ? 'Listening...' : 'Click the button to begin')}
     </div>
   </div>

  );  
}

export default App;

Now, let's implement the actual transcription event:

const handleTranscriptionToggle = async () => {
  if (isTranscribing) {
    mediaRecorderRef.current?.stop();
    streamRef.current?.getTracks().forEach((track) => track.stop());
    socketRef.current?.close();
    setIsTranscribing(false);
  } else {
    try {
      const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
      streamRef.current = stream;

      const mediaRecorder = new MediaRecorder(stream, { mimeType: 'audio/webm' });
      mediaRecorderRef.current = mediaRecorder;

      const socket = new WebSocket('wss://api.deepgram.com/v1/listen?model=nova-3', [
        'token',
        import.meta.env.VITE_DEEPGRAM_API_KEY
      ]);
      socketRef.current = socket;

      socket.onopen = () => {
        mediaRecorder.addEventListener('dataavailable', (event) => {
          if (socket.readyState === WebSocket.OPEN) {
            socket.send(event.data);
          }
        });
        mediaRecorder.start(250);
      };

      socket.onmessage = (message) => {
        const received = JSON.parse(message.data);
        const result = received.channel.alternatives[0]?.transcript;
        if (result) {
          setTranscript((prev) => prev + ' ' + result);
        }
      };

      setIsTranscribing(true);
    } catch (err) {
      console.error('Failed to start transcription:', err);
    }
  }
};

This combines the logic from our simplified handleTranscriptionToggle function with the Web APIs. 

When the user clicks this function for the first time, it establishes a connection to the Deepgram Live Audio endpoint. We store the Deepgram API key in the Vite environment variable as VITE_DEEPGRAM_API_KEY.

The browser APIs are accessed via their references. The callback that is called when the socket initially opens is the same as the one we defined earlier. https://github.com/Neurl-LLC/React_DeepGram_Transcription_App/tree/main/src

In contrast, the callback that handles incoming messages utilizes the setTranscript function to append new transcripts to the transcript state. It also uses setIsTranscribing to hange isTranscribing to true.

Put everything together to have the complete component, as seen in the Github repo linked here. Note that the code in the repository adds styling through the App.css file and the index.css file. The styling make the app look like this:

Conclusion: Build a Real-Time Transcription App with React and Deepgram

The application we’ve built is simple but powerful. With real-time transcription powered by Deepgram’s Nova 3 model, you can use it out of the box for things like interviews, meetings, or podcasts. And because Nova-3 supports real-time multilingual transcription, this app isn’t limited to just English; it can handle many languages on the fly.

This isn’t just an app. It’s a foundation. Whether you're building a voice AI agent, an accessibility tool, a live captioning service, or your twist on real-time audio experiences, this gives you a solid place to start.

Resources for Further Building

Here are several helpful references and resources for learning more about building with Deepgram:

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.