Article·Nov 18, 2025

How to Build a Highlight Read Aloud App with Aura-2 Text-to-Speech

Today’s tutorial will demonstrate an app that utilizes Deepgram’s Javascript SDK, which uses the Aura-2 Text-to-Speech API to read out any highlighted text on a page of text.

12 min read
Headshot of Zian (Andy) Wang

By Zian (Andy) Wang

AI Content Fellow

Last Updated

Audio books are great; they let you listen to literature while on the go and allow you to interpret the story through an auditory experience which oftentimes can bring new meanings to books that you’ve already read.

But sometimes even when I’m not on the go, or simply getting tired of reading a long article, an “audio book” tool can be extremely helpful. I can highlight certain parts of the text or even paragraphs within a section and have a voice instantly read it out to me.

Thankfully, with Deepgram’s Text-to-Speech API, such a feature can be implemented without ever having to hire a professional voice over reader for every long text that you encounter on the internet.

Deepgram’s Text-to-Speech API is incredibly versatile. It has the ability to handle complex, industry-grade applications. However, it remains trivially simple to use. Today’s tutorial will demonstrate an app that utilizes Deepgram’s Javascript SDK that uses Deepgram’s Text-to-Speech API to read out any highlighted text on a page of text.

Structuring the HTML Foundation

Every web application starts with a solid HTML structure. For this project, we’ll create a single index.html file that will contain all of our HTML, styling, and JavaScript.

Imports and Styling

First, let’s set up the <head> of our HTML file. This section is crucial for importing our dependencies and defining basic styles. We will:

  • Set up meta tags for responsive design.
  • Import the Tailwind CSS library via a CDN for styling.
  • Import the ‘Inter’ font from Google Fonts for a clean, modern look.
  • Add a small <style> block to customize the color of highlighted text, improving the user experience.

Copy the code below to begin your index.html file.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Deepgram Text-to-Speech Highlighter</title>
    <script src="[https://cdn.tailwindcss.com](https://cdn.tailwindcss.com)"></script>
    <link rel="preconnect" href="[https://fonts.googleapis.com](https://fonts.googleapis.com)">
    <link rel="preconnect" href="[https://fonts.gstatic.com](https://fonts.gstatic.com)" crossorigin>
    <link href="[https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap](https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap)" rel="stylesheet">
    <style>
        body {
            font-family: 'Inter', sans-serif;
        }
        /* Custom selection color */
        ::selection {
            background-color: #fde68a; /* A pleasant yellow */
            color: #1f2937;
        }
    </style>
</head>

The Body Section of the HTML

Next, we’ll build the user interface inside the <body> tag. The structure consists of a main container that holds all our elements. These include an input for the Deepgram API key, a textarea for the user’s text, a button to load the text into a reading area, and a section for audio controls that will appear once audio is ready. At the end of the body, we include an empty <script type="module">tag, which is where all of our application logic will go in the next steps.

Add the following <body> section to your index.html file, right after the closing </head> tag.

<body class="bg-gray-900 text-gray-200 flex items-center justify-center min-h-screen p-4">

    <div class="w-full max-w-3xl mx-auto bg-gray-800 rounded-2xl shadow-2xl p-6 md:p-8 space-y-6">
        
        <!-- Header -->
        <div class="text-center">
            <h1 class="text-3xl font-bold text-white">Deepgram TTS Highlighter</h1>
            <p class="text-gray-400 mt-2">Paste your text, highlight a selection, and listen.</p>
        </div>

        <!-- API Key Input -->
        <div class="space-y-2">
            <label for="apiKey" class="text-sm font-medium text-gray-300">Deepgram API Key</label>
            <input type="password" id="apiKey" placeholder="Enter your Deepgram API Key" class="w-full px-4 py-2 bg-gray-700 border border-gray-600 rounded-lg text-white placeholder-gray-500 focus:outline-none focus:ring-2 focus:ring-indigo-500 transition-all duration-200">
            <p class="text-xs text-amber-500">
                <span class="font-bold">Note:</span> Your API key is used only in your browser and is not saved. This is for demo purposes only.
            </p>
        </div>

        <!-- Text Input -->
        <div class="space-y-2">
             <label for="textInput" class="text-sm font-medium text-gray-300">Your Text</label>
            <textarea id="textInput" rows="6" class="w-full px-4 py-2 bg-gray-700 border border-gray-600 rounded-lg text-white placeholder-gray-500 focus:outline-none focus:ring-2 focus:ring-indigo-500 transition-all duration-200" placeholder="Paste your article, notes, or any text here..."></textarea>
            <button id="loadTextBtn" class="w-full bg-indigo-600 text-white font-semibold py-2 px-4 rounded-lg hover:bg-indigo-700 focus:outline-none focus:ring-2 focus:ring-offset-2 focus:ring-offset-gray-800 focus:ring-indigo-500 transition-all duration-200 transform active:scale-95">Load Text to Display</button>
        </div>
        
        <!-- Text Display & Controls -->
        <div class="space-y-4">
            <h2 class="text-lg font-semibold text-white border-b border-gray-700 pb-2">Reading Area</h2>
            
            <!-- Message/Status Area -->
            <div id="statusMessage" class="hidden p-3 rounded-lg text-center text-sm"></div>
            
            <div id="textDisplay" class="bg-gray-900/50 p-6 rounded-lg h-64 overflow-y-auto text-gray-300 leading-relaxed select-text transition-all duration-200">
                <p class="text-gray-500 italic">Your loaded text will appear here. Highlight any part of it to start listening.</p>
            </div>

            <!-- Audio Player Controls -->
            <div id="audioControls" class="hidden flex items-center justify-center gap-6 bg-gray-700/50 p-4 rounded-xl">
                <button id="playPauseBtn" class="p-3 bg-indigo-600 rounded-full text-white hover:bg-indigo-500 focus:outline-none focus:ring-2 focus:ring-offset-2 focus:ring-offset-gray-800 focus:ring-indigo-500 transition-all duration-200">
                    <!-- SVG icon will be injected here -->
                </button>
                <div class="flex items-center gap-3">
                    <span class="text-sm font-medium">Speed:</span>
                    <input type="range" id="speedControl" min="0.5" max="2" step="0.1" value="1" class="w-32 md:w-48 cursor-pointer accent-indigo-500">
                    <span id="speedValue" class="text-sm font-semibold w-8 text-center">1.0x</span>
                </div>
            </div>
            <audio id="audioPlayer" class="hidden"></audio> 
        </div>
    </div>

    <script type="module">
        // Our JavaScript will go here
    </script>
</body>
</html>

Setting Up the JavaScript Environment

The Deepgram Javascript SDK is typically used with Node JS. However, for lightweight, broswer based, front end only applications, Deepgram’s Javascript SDK can still be used just as easily.

With the browser based version though, there are a couple quirks we need to resolve before using it. We need to solve two key issues:

  1. The Buffer object: The SDK relies on the Buffer object for handling binary data, which is a built-in feature of Node.js but does not exist in browsers. We must provide a “polyfill”, which is a piece of code that implements this missing feature.
  2. Module Compatibility: We need to use the modern ES Module (+esm) version of the SDK, which is designed for browser-based import statements.

We’ll add our code inside the <script type="module"> tag at the end of the <body>. First, we import Buffer and attach it to the global window object. This crucial step must happen before we import the Deepgram SDK, ensuring the SDK can find and use our polyfill. Then, we can import the SDK, select our DOM elements, and declare variables for our Web Audio API setup.

Update your <script> tag to look like this:

   <script type="module">
        // First, import Buffer and polyfill the window object.
        // This makes the Buffer class available globally for the Deepgram SDK to use.
        import { Buffer } from '[https://cdn.jsdelivr.net/npm/buffer@6.0.3/+esm](https://cdn.jsdelivr.net/npm/buffer@6.0.3/+esm)';
        window.Buffer = Buffer;

        // Now, import the Deepgram SDK. It will now find window.Buffer.
        import { createClient, LiveTTSEvents } from '[https://cdn.jsdelivr.net/npm/@deepgram/sdk/+esm](https://cdn.jsdelivr.net/npm/@deepgram/sdk/+esm)';

        const apiKeyInput = document.getElementById('apiKey');
        const textInput = document.getElementById('textInput');
        const loadTextBtn = document.getElementById('loadTextBtn');
        const textDisplay = document.getElementById('textDisplay');
        const audioControls = document.getElementById('audioControls');
        const playPauseBtn = document.getElementById('playPauseBtn');
        const speedControl = document.getElementById('speedControl');
        const speedValue = document.getElementById('speedValue');
        const statusMessage = document.getElementById('statusMessage');

        // Web Audio API setup
        let audioContext;
        let audioSource;
        let isPlaying = false;

        // SVG Icons for play/pause button
        const playIcon = `<svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><polygon points="5 3 19 12 5 21 5 3"></polygon></svg>`;
        const pauseIcon = `<svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><rect x="6" y="4" width="4" height="16"></rect><rect x="14" y="4" width="4" height="16"></rect></svg>`;

        // Initialize button icon
        playPauseBtn.innerHTML = playIcon;

        function initializeAudioContext() {
            if (!audioContext) {
                audioContext = new (window.AudioContext || window.webkitAudioContext)();
            }
        }

        // --- Event Listeners and Functions will go here ---
    </script>

Connecting to Deepgram and Handling Events

To avoid browser security restrictions known as CORS (Cross-Origin Resource Sharing) that block direct API calls from a web page to a different domain, we must use WebSockets. The Deepgram SDK provides the speak.live method for exactly this scenario, allowing for real-time, two-way communication.

Our core logic will live in an async function called getTextToSpeech. This function is triggered when a user finishes highlighting text. It establishes the WebSocket connection and sets up listeners for key events:

  • LiveTTSEvents.Open: Fires when the connection is established. We then send the highlighted text to Deepgram.
  • LiveTTSEvents.Audio: Fires repeatedly as audio data streams back from the server. We collect these chunks of data in an array.
  • LiveTTSEvents.Close: Fires when all audio has been sent. We then process the collected chunks and play them.
  • LiveTTSEvents.Error: Catches any connection errors.

Let’s add our event listeners and the main getTextToSpeech function to the script. This section of the script is a little long, however, the logic is trivial and there are plenty of comments if you get lost.

   <script type="module">
        // ... (previous code from Step 2) ...

        function initializeAudioContext() {
            if (!audioContext) {
                audioContext = new (window.AudioContext || window.webkitAudioContext)();
            }
        }

        // --- Event Listeners ---
        
        loadTextBtn.addEventListener('click', () => {
            const text = textInput.value;
            if (text.trim()) {
                const formattedText = text.split(/\n+/).map(p => `<p>${p}</p>`).join('');
                textDisplay.innerHTML = formattedText;
                showStatus('Text loaded. Highlight a sentence to begin.', 'success');
            } else {
                textDisplay.innerHTML = `<p class="text-gray-500 italic">Please enter some text in the box above first.</p>`;
            }
        });

        // NOTE: Playback control listeners will be added in a later step.
        
        // Attach event listener for text selection
        textDisplay.addEventListener('mouseup', handleSelection);

        // --- Core Functions ---
        
        function handleSelection() {
            const selectedText = window.getSelection().toString().trim();
            if (selectedText.length > 0) {
                getTextToSpeech(selectedText);
            }
        }

        async function getTextToSpeech(text) {
            const apiKey = apiKeyInput.value.trim();
            if (!apiKey) {
                showStatus('Please enter your Deepgram API Key.', 'error');
                return;
            }
            
            initializeAudioContext();
            
            if (audioSource) {
                audioSource.stop();
            }

            if (audioContext.state === 'suspended') {
                audioContext.resume();
            }

            showStatus('Generating audio...', 'loading');
            
            try {
                const deepgramClient = createClient(apiKey);
                const connection = deepgramClient.speak.live({ model: "aura-2-thalia-en" });

                const audioChunks = [];

                connection.on(LiveTTSEvents.Open, () => {
                    connection.sendText(text);
                    connection.flush();
                });

                connection.on(LiveTTSEvents.Audio, (audioData) => {
                    audioChunks.push(audioData);
                });

                connection.on(LiveTTSEvents.Close, () => {
                    if (audioChunks.length > 0) {
                        // We will define these functions in the next step
                        const audioBuffer = concatenateAndDecode(audioChunks);
                        playAudio(audioBuffer);
                        hideStatus();
                        audioControls.classList.remove('hidden');
                    } else {
                        showStatus('No audio data received.', 'error');
                    }
                });

                connection.on(LiveTTSEvents.Error, (error) => {
                    console.error('Deepgram WebSocket Error:', error);
                    showStatus(`Error: ${error.message || 'A WebSocket error occurred.'}`, 'error');
                });

            } catch (error) {
                console.error('Error setting up TTS audio:', error);
                showStatus(`Error: ${error.message || 'Failed to set up audio generation.'}`, 'error');
            }
        }

        // --- Helper functions for status UI ---
        
        function showStatus(message, type = 'info') {
            statusMessage.textContent = message;
            statusMessage.classList.remove('hidden', 'bg-blue-900', 'bg-green-900', 'bg-red-900');
            
            let bgColor = 'bg-blue-900';
            if (type === 'success') bgColor = 'bg-green-900';
            if (type === 'error') bgColor = 'bg-red-900';
            
            statusMessage.classList.add(bgColor);
        }
        
        function hideStatus() {
            statusMessage.classList.add('hidden');
        }

        // --- Audio Processing Functions will go here ---
    </script>

Processing Raw Audio with the Web Audio API

The data streaming from Deepgram is raw 16-bit PCM (Pulse-Code Modulation) audio. This is a high-quality, uncompressed format, but standard HTML <audio> tags can’t play it without a proper file container like .wav or .mp3. To solve this, we use the browser’s powerful Web Audio API, which is designed for this kind of low-level audio manipulation.

We’ll create two helper functions. The first, concatenateAndDecode, will take our array of audio chunks, merge them into a single data stream, and convert the 16-bit integer format into the 32-bit floating-point format that the Web Audio API requires. The second, playAudio, will take this processed data, load it into an AudioBuffer, and play it.

Add the following two functions to your script.

   <script type="module">
        // ... (previous code from Step 3) ...

        function concatenateAndDecode(chunks) {
            // Calculate total length of all chunks
            const totalLength = chunks.reduce((acc, chunk) => acc + chunk.length, 0);
            
            // Create a new Uint8Array to hold the combined data
            const combined = new Uint8Array(totalLength);
            
            // Copy each chunk into the combined array
            let offset = 0;
            for (const chunk of chunks) {
                combined.set(chunk, offset);
                offset += chunk.length;
            }

            // Deepgram's Aura models send 16-bit PCM audio, so we have 2 bytes per sample.
            const pcmData = new Int16Array(combined.buffer);
            
            // Create a Float32Array for the Web Audio API, normalizing samples to [-1, 1]
            const float32Data = new Float32Array(pcmData.length);
            for (let i = 0; i < pcmData.length; i++) {
                float32Data[i] = pcmData[i] / 32768.0;
            }
            
            return float32Data;
        }

        function playAudio(float32Data) {
            // Assume a sample rate of 24000, common for Aura models
            const sampleRate = 24000;
            const audioBuffer = audioContext.createBuffer(1, float32Data.length, sampleRate);
            audioBuffer.copyToChannel(float32Data, 0);

            audioSource = audioContext.createBufferSource();
            audioSource.buffer = audioBuffer;
            audioSource.playbackRate.value = parseFloat(speedControl.value);
            audioSource.connect(audioContext.destination);
            audioSource.start(0);

            audioSource.onended = () => {
                isPlaying = false;
                playPauseBtn.innerHTML = playIcon;
            };

            isPlaying = true;
            playPauseBtn.innerHTML = pauseIcon;
            
            audioContext.onstatechange = () => {
                 if(audioContext.state === 'running') {
                    isPlaying = true;
                    playPauseBtn.innerHTML = pauseIcon;
                } else if (audioContext.state === 'suspended') {
                    isPlaying = false;
                    playPauseBtn.innerHTML = playIcon;
                }
            };
        }
        
        function showStatus(message, type = 'info') {
            // ... (showStatus function from before) ...
        }
        
        function hideStatus() {
            // ... (hideStatus function from before) ...
        }

    </script>

Implementing Playback Controls

With our audio processing in place, the final step is to wire up the play/pause and speed controls. Instead of controlling an <audio> element, these controls will interact directly with the AudioContext and the AudioBufferSourceNode. The play/pause button will suspend() and resume() the AudioContext, and the speed slider will adjust the playbackRate property of our audio source. This gives us robust and responsive control over the audio.

Add the final event listeners to your script. This completes the application.

   <script type="module">
        // ... (imports, variable declarations, etc. from Step 2) ...

        function initializeAudioContext() {
            // ... (function from Step 2) ...
        }

        // --- Event Listeners ---
        
        loadTextBtn.addEventListener('click', () => {
             // ... (function from Step 3) ...
        });

        playPauseBtn.addEventListener('click', () => {
            if (!audioContext) return;
            if (audioContext.state === 'suspended') {
                audioContext.resume();
            } else if (audioContext.state === 'running') {
                audioContext.suspend();
            }
        });

        speedControl.addEventListener('input', () => {
            const speed = parseFloat(speedControl.value);
            if (audioSource) {
                audioSource.playbackRate.value = speed;
            }
            speedValue.textContent = `${speed.toFixed(1)}x`;
        });
        
        textDisplay.addEventListener('mouseup', handleSelection);

        // --- Core Functions ---
        
        function handleSelection() {
            // ... (function from Step 3) ...
        }

        async function getTextToSpeech(text) {
            // ... (function from Step 3) ...
        }

        function concatenateAndDecode(chunks) {
            // ... (function from Step 4) ...
        }

        function playAudio(float32Data) {
            // ... (function from Step 4) ...
        }
        
        function showStatus(message, type = 'info') {
            // ... (function from Step 3) ...
        }
        
        function hideStatus() {
            // ... (function from Step 3) ...
        }
    </script>

Conclusion

You have now built a fully functional, front-end only text-to-speech application using Deepgram’s SDK. By polyfilling the Buffer object, using the SDK’s WebSocket interface to bypass CORS limitations, and processing the raw audio with the Web Audio API, we overcame the common hurdles of using a Node.js-centric library in the browser. From here, you can experiment with different Deepgram voices, add more advanced features, or integrate this functionality into a larger project.

One Step Further

The logic we’ve built can be adapted to create a powerful Chrome Extension that can read highlighted text from any webpage. It is entirely possible and a great next step. Here’s a high-level overview of how you’d approach it:

  • manifest.json: This is the core configuration file for any Chrome Extension. You would define the extension’s name, version, permissions, and what scripts to run. You would need permissions like activeTab and potentially storage to securely save the user’s API key.
  • Content Script: A content script is a JavaScript file that runs in the context of a webpage. You would use a content script to detect when a user highlights text (mouseup event) on any page and then send this selected text to your background script.
  • Background Script (Service Worker): This is where the core logic from our <script type="module"> would live. The background script would receive the highlighted text from the content script, establish the WebSocket connection to Deepgram, process the audio data, and then play it back.
  • User Interface: Instead of a full webpage, you might have a small popup for entering the API key or a simple “play” button that appears near the highlighted text.

To get started, the official Chrome Extension development documentation is the best resource.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.