NAV Navbar
shell python javascript

Introduction

Deepgram helps you harness the potential of your voice data with intelligent speech models built to scale and continuously improve over time. The API is the gateway to Deepgram's Brain AI models, and gives you customizable access to fast, high accuracy transcription and phonetic search. Deepgram Brain can understand nearly every audio format available.

Hello World

Let's get started quickly! Grab your favorite HTTP client. In case you don't have an audio file lying around, we'll use an existing one on the internet. To authenticate, we'll use HTTP Basic Auth (over HTTPS) with our username and password (we can create API keys later on).

""" Submit a remote file to Deepgram Brain
"""
import base64
import urllib.request
import json

creds = ('USERNAME', 'PASSWORD')
request = urllib.request.Request(
  'https://brain.deepgram.com/v2/listen',
  method='POST',
  headers={
    'Content-type': 'application/json',
    'Authorization': 'Basic {}'.format(
      base64.b64encode('{}:{}'.format(*creds).encode('utf-8')).decode('utf-8')
    )
  },
  data=json.dumps({
    'url': 'https://www.deepgram.com/examples/interview_speech-analytics.wav'
  }).encode('utf-8')
)

with urllib.request.urlopen(request) as response:
  print(json.loads(response.read()))
# Submit a remote file to Deepgram Brain
curl \
  -X POST \
  -u USERNAME:PASSWORD \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.deepgram.com/examples/interview_speech-analytics.wav"}' \
  https://brain.deepgram.com/v2/listen

If you already have an audio file lying around on your machine, simply POST its binary contents! Deepgram Brain supports over 40 audio formats, including the popular WAV, MP3, M4A, FLAC, and Opus.

""" Submit a local file to Deepgram Brain
"""
import base64
import urllib.request
import json

creds = ('USERNAME', 'PASSWORD')
request = urllib.request.Request(
  'https://brain.deepgram.com/v2/listen',
  method='POST',
  headers={
    'Authorization': 'Basic {}'.format(
      base64.b64encode('{}:{}'.format(*creds).encode('utf-8')).decode('utf-8')
    )
  },
  data=open('AUDIO.WAV', 'rb').read()
)

with urllib.request.urlopen(request) as response:
  print(json.loads(response.read()))
# Submit a local file to Deepgram Brain
curl \
  -X POST \
  -u USERNAME:PASSWORD \
  --data-binary @AUDIO.WAV \
  https://brain.deepgram.com/v2/listen

Speech Recognition

Deepgram supports high-speed transcription of pre-recorded audio files. This feature is customizable using various query parameters, supports processing with tailored and custom-built AI models, and is extremely fast. Moreover, Deepgram supports over 40 audio codecs, including WAV, MP3, FLAC, and AAC.

All transcription requests should be sent to https://brain.deepgram.com/v2/listen using the POST method.

Usage

Example using raw binary audio and searching for "recorded line"

curl \
  -X POST \
  -u aladdin:opensesame \
  -H "Content-Type: audio/wav" \
  --data-binary @my-file.wav \
  "https://brain.deepgram.com/v2/listen?search=recorded%20line"
import base64
import urllib.request

url = 'https://brain.deepgram.com/v2/listen?search=recorded%20line'
username = 'aladdin'
password = 'opensesame'

headers = {}
headers['Authorization'] = 'Basic {}'.format(
  base64.b64encode('{}:{}'.format(username, password).encode('utf-8')).decode('utf-8')
)
headers['Content-Type'] = 'audio/wav'

data = open('/path/to/audio.wav', 'rb')

req = urllib.request.Request(
  url,
  method='POST',
  headers=headers,
  data=data
)
resp = urllib.request.urlopen(req)

# raw response data in resp.read()



const axios = require('axios');
const fs = require('fs');

let url = 'https://brain.deepgram.com/v2/listen';
let username = 'aladdin';
let password = 'opensesame';
let audio = fs.readFileSync('/path/to/audio.wav');

axios({
  method: 'post',
  url: url,
  auth: {
    username: username,
    password: password
  },
  headers: {
    'Content-Type': 'audio/wav'
  },
  params: {
    model: 'meeting',
    multichannel: true,
    punctuate: true,
    search: 'recorded line'
  },
  data: audio
})
  .then(response => {
    console.log(response.data);
  })
  .catch(error => {
    console.log('Error happened!: ' + error);
  });

Example using a hosted url for audio data below:

curl \
  -X POST \
  -u aladdin:opensesame \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://www.myhostedaudio.com/path/to/myfile.wav" }' \
  "https://brain.deepgram.com/v2/listen?model=phonecall&keywords=thank&keywords=welcome"
import base64
import json
import urllib.request

url = 'https://brain.deepgram.com/v2/listen'
username = 'aladdin'
password = 'opensesame'

headers = {}
headers['Authorization'] = 'Basic {}'.format(
  base64.b64encode('{}:{}'.format(username, password).encode('utf-8')).decode('utf-8')
)
headers['Content-Type'] = 'application/json'

data = { 'url': 'https://www.myhostedaudio.com/path/to/your/audio' }

req = urllib.request.Request(
  url,
  method='POST',
  headers=headers,
  data=json.dumps(data).encode('utf-8')
)
resp = urllib.request.urlopen(req)

# raw response data in resp.read()
const axios = require('axios');

let url = 'https://brain.deepgram.com/v2/listen';
let username = 'aladdin';
let password = 'opensesame';
let audio = 'https://myhostedaudio.com/path/to/file.wav';

axios({
  method: 'post',
  url: url,
  auth: {
    username: username,
    password: password
  },
  headers: {
    'Content-Type': 'application/json'
  },
  data: {
    url: audio
  }
})
  .then(response => {
    console.log(response.data);
  })
  .catch(error => {
    console.log('Error happened!: ' + error);
  });

See the examples for how to do this.

Configuration

The following query parameters may be used to configure how Brain processes a submitted audio file:

Parameter Default Description
model general The AI model to use for processing uploaded audio. It may be any standard Deepgram Brain model, or a custom model associated with your account. The list of standard Deepgram models is frequently updated, and currently consists of:
  • general: a good, general-purpose model for everyday audio processing. If you aren't sure what model to select, start with this one.
  • phonecall: optimized for low-bandwidth audio phone calls.
  • meeting: optimized for conference room settings: multiple speakers with a single microphone.
  • If multichannel is set to true, the model specified will be applied to all audio channels. However, it is possible to apply different models to the first two audio channels by specifying, e.g., general:phonecall, which will apply the general model to channel 0 and the phonecall model to channel 1.
    language en-US A BCP-47 language tag that hints at the primary language spoken. If absent, the primary language will be guessed. Language support is constantly improving in Deepgram Brain, and it is currently optimized for the following languages:
  • English (en-US, en-GB, en-NZ)
  • Spanish (es)
  • Korean (ko)
  • French (fr)
  • Portuguese (pt, pt-BR)
  • Russian (ru)
  • diarize beta false Whether or not to recognize speaker changes at the word level. If set to true, each word of the resulting transcript will be assigned a speaker number starting at 0.
  • Optional. The maximum number of speakers that can be predicted is 10. It is not necessary to change this number to predict fewer speakers (e.g. a single channel phone call with only two speakers is expected to be automatically detected as 2 even if max_speakers=10 — the maximum is just that, a maximum). If you know that there will be fewer speakers, you can experiment with setting max_speakers=N where N is 2-10. This will reduce the number of possible speakers and may (but is not guaranteed to) improve performance, but is not required.
  • multichannel false Whether or not to treat the audio channels as independent. If your audio has isolated speakers on each channel (e.g., a phone call with one speaker on each audio channel), then setting this to true will give you per-channel transcripts. If set to true, it is possible to apply different models to the first two audio channels - see the model description for details.
    punctuate false Whether or not to add punctuation to the resulting transcript.
    alternatives 1 The maximum number of transcript alternatives to return. Just like human listeners will sometimes hold more than one possible interpretation of the words they hear, Brain is also capable of providing multiple alternative interpretations of the audio it listens to. By passing an alternatives parameter greater than 1, a user can give an upper limit on the number of alternatives they wish to see.
    search or search[] Ask Brain to search for certain terms in the submitted audio. It should be noted that Deepgram's Brain does not perform a search for text patterns in the outputted transcript, but rather for phonetic patterns in the audio itself. We have noticed that for searching ASR transcription, phonetic search is more performant. This parameter may be passed multiple times in a single request to /listen to search for multiple terms.
    callback A user may request processing of their audio be done asynchronously by specifying the callback parameter. When this parameter is passed, the call to /listen will return immediately with just a request_id. When Deepgram Brain has finished analyzing the audio, the typical response will be sent by POST request the the HTTP/HTTPS callback URL, with an appropriate status code. Basic authentication credentials can be embedded in the callback URL. Note that only ports 80, 443, 8080, and 8443 can be used for callbacks.
    keywords or keywords[] A list of important keywords to which Brain should pay particular attention. When humans are listening to hard-to-decipher speech, knowledge of the context of the conversation helps them to recognize certain words that may be mumbled or otherwise distorted. Think of the keywords parameter as a way to give Brain a little context for what it's hearing in a submitted audio file. This parameter may be passed more than once to specify multiple keywords.

    Response

    Brain's response from the /listen endpoint will be a JSON-formatted ListenResponse object.

    ListenResponse

    Attribute Type Description
    metadata obj JSON-formatted ListenMetadata object.
    results obj JSON-formatted ListenResults object.

    ListenMetadata

    Attribute Type Description
    request_id string A unique identifier for the audio submitted and for the derived data returned by Brain
    transaction_key string A blob of text that is useful for Deepgram engineers to debug any problems you encounter. If you need help getting an API call to work correctly, send this key to us that we may use it as a starting point for investigating any issues.
    sha256 string A SHA-256 hash of the audio data submitted.
    created timestamp An ISO-8601 timestamp from when the audio was submitted to Brain.
    duration float Duration in seconds of the submitted audio.
    channels integer Number of channels detected by Brain in the submitted audio.

    ListenResults

    Attribute Type Description
    channels obj[] Array of JSON-formatted ChannelResult object.

    ChannelResult

    Object representations of a channel in the underlying audio object.

    Attribute Type Description
    search obj[] An array of JSON-formatted SearchResults.
    alternatives obj[] An array of JSON-formatted ResultAlternative ojects. This array will have length n, where n is the number passed to /listen with ?alternatives=n.

    SearchResult

    Attribute Type Description
    query string The term searched for.
    hits obj[] An array of JSON-formatted Hit objects.

    Hit

    Attribute Type Description
    confidence float A value between 0 and 1 indicating Brain's relative confidence in this hit.
    start float The offset from the start of the audio, in seconds, where the hit occurs.
    end float The offset from the start of the audio, in seconds, where the hit ends.
    snippet string The ASR transcript that corresponds to the time between start and end.

    ResultAlternative

    Attribute Type Description
    transcript string A single string transcript of what Brain hears in this channel of audio.
    confidence float A value between 0 and 1 of Brain's relative confidence in this transcript.
    words obj[] An array of JSON-formatted Word objects.

    Word

    Attribute Type Description
    word string A distinct word heard by Brain.
    start float Offset from the start of the audio, in seconds, at which the spoken word starts.
    end float Offset from the start of the audio, in seconds, at which the spoken word ends.

    Real-time Speech Recognition

    Deepgram provides its customers with real-time, streaming transcription via its streaming endpoints. These endpoints are high-performance, full-duplex services running over the tried-and-true Websocket protocol, making integration with customer pipelines simple due to the wide array of client libraries available.

    Usage

    {
      "channel": [CHANNEL, NUM_CHANNELS],
      "duration": DURATION_IN_SECONDS,
      "start": OFFSET_IN_SECONDS,
      "is_final": IS_FINAL,
      "alternatives": [
        {
          "transcript": TRANSCRIPT,
          "confidence": CONFIDENCE,
          "words": [
            {
              "word": WORD,
              "start": START_TIME,
              "end": END_TIME,
              "confidence": WORD_CONFIDENCE
            },
            ...
          ]
        },
        ...
      ]
    }
    
    In this example, we use Python 3.6 with the `websockets` package (available on PyPI and installable using `pip`). We use asynchronous programming for performance.
    
    import base64
    import json
    import websockets
    import asyncio
    
    def connect(user, pwd):
        """ Connect to the websocket endpoint.
        """
        return websockets.connect(
            ## Connect to the secure websocket endpoint.
            'wss://brain.deepgram.com/v2/listen/stream',
    
            ## Add the basic authorization header.
            extra_headers={
                'Authorization' : 'Basic {}'.format(
                    base64.b64encode('{}:{}'.format(user, pwd).encode()).decode()
                )
            },
        )
    
    async def run(user, pwd, inbox, outbox):
        """ Streams data to/from the service.
    
            We will use `user` and `pwd` to authenticate.
            We will read audio data from the `outbox` queue and send it to the server.
            When we receive responses from the server, we will put the latest transcript in the `inbox` queue.
        """
    
        ## Connect to the streaming endpoint.
        async with connect(user, pwd) as ws:
    
            ## This example function will upload data.
            async def sender(ws):
                while True:
                    audio = await outbox.get()
                    if audio is None:
                        break
    
                    await ws.send(audio)
    
                ## Close the connection cleanly.
                await ws.send(b'')
    
            ## This example function will handle responses.
            async def receiver(ws):
                async for msg in ws:
                    ## Deserialize the JSON message.
                    msg = json.loads(msg)
    
                    ## Get the transcript and put in the queue.
                    if 'alternatives' in msg:
                        transcript = msg['alternatives'][0]['transcript']
                        print('Metadata received:', msg)
                        await inbox.put(transcript)
                    elif 'channel' in msg and msg['is_final']:
                        transcript = msg['channel']['alternatives'][0]['transcript']
                        print('Metadata received:', msg)
                        await inbox.put(transcript)
                    else:
                        print('Metadata received:', msg)
    
            await asyncio.wait([
                asyncio.ensure_future(sender(ws)),
                asyncio.ensure_future(receiver(ws))
            ])
    
    if __name__ == '__main__':
        ## Create the queues.
        inbox = asyncio.Queue()
        outbox = asyncio.Queue()
    
        ## Fake some audio data. This example interprets None as the shutdown signal.
        outbox.put_nowait(b'\x00' * 1000)
        ## alternatively, send raw audio from your machine
        # with open('/PATH/TO/SOME/audio.wav', 'rb') as f:
        #     while True:
        #         piece = f.read(4096)
        #         if piece == b'':
        #             break
        #         outbox.put_nowait(piece)
        outbox.put_nowait(None)
    
        ## Run the event loop.
        asyncio.get_event_loop().run_until_complete(run('my_username', 'my_password', inbox, outbox))
    
        ## Print out the results
        print('All done. Replaying all the transcript messages...')
        while True:
            try:
                msg = inbox.get_nowait()
            except asyncio.QueueEmpty:
                break
    
            print('  Received transcript:', msg)
    
    This example can be run using node.
    
    // Connect to the streaming endpoint.
    var establishConnection = function() {
        console.log("Establishing connection.");
    
        // Configure the websocket connection.
        // This requires ws installed using 'npm i ws'.
        const WebSocket = require('ws');
        socket = new WebSocket(
            'wss://brain.deepgram.com/v2/listen/stream',
            // if your base64 encoded username:password has padding ('=' signs at the end), you must strip them
            ['Basic', 'aHVzdG9uQGRlZXBncmFtLmNvbTpodTV0b24']
        );
        socket.onopen = (m) => {
            console.log("Socket opened!");
    
            // Grab an audio file.
            var fs = require('fs');
            var contents = fs.readFileSync('/home/nikola/Work/test_audio_files/epistemology.16k.wav.u.gsm.u.16k.wav');
    
            // Send the audio to the brain api all at once (works if audio is relatively short).
            // socket.send(contents);
    
            // Send the audio to the brain api in chunks of 1000 bytes.
            chunk_size = 1000;
            for (i = 0; i < contents.length; i += chunk_size) {
                slice = contents.slice(i, i + chunk_size);
                socket.send(slice);
            }
    
            // Send the message to close the connection.
            socket.send(new Uint8Array(0));
        };
        socket.onclose = (m) => {
            console.log("Socket closed.");
        };
    
        socket.onmessage = (m) => {
            m = JSON.parse(m.data);
            // Log the received message.
            console.log(m);
    
            // Log just the words from the received message.
            if (m.hasOwnProperty('channel')) {
                let words = m.channel.alternatives[0].words;
                console.log(words);
            }
        };
    };
    
    var socket = null;
    establishConnection();
    

    Note: Code examples are available in Python and Javascript.

    The endpoint exists at the same server as your regular API endpoint, but uses the Websocket protocol. So if you usually connect to https://brain.deepgram.com/v2 for API access, instead connect to wss://brain.deepgram.com/v2. The wss protocol indicates that standard SSL encryption is used to protect your connection and data. Just as regular batch processing listens at the /listen path, so streaming listens at /listen/stream, so the full URI will look like wss://brain.deepgram.com/v2/listen/stream.

    You will need to authenticate to access the streaming service. This is accomplished with an HTTP Basic Authentication header with your username/password, in the exact same fashion as regular API access. Examples of this are shown.

    Configuration

    All configuration/settings for your WebSocket session are specified in the URL as query parameters, just as with the regular /listen endpoint. In addition to all the standard arguments that you can pass to /listen, the streaming endpoint accepts these arguments:

    The callback parameter can also be used to redirect streaming responses at a different server. If the callback URL begins with http:// or https://, then POST requests are sent to the callback server for each streaming response. If the callback URL begins with ws:// or wss://, then a WebSocket connection is established with the callback server and WebSocket text messages are sent containing the streaming responses. If a WebSocket callback connection is disconnected at any point, the entire real-time transcription stream is killed; this maintains the strong guarantee of a one-to-one relationship between incoming real-time connections and outgoing WebSocket callback connections.

    Data

    All data is sent to the streaming endpoint as binary-type Websocket messages, whose payloads are simply the raw audio data. You can stream this to Deepgram in real-time, and since the protocol is full-duplex, you will still receive transcription responses while uploading data.

    Deepgram's streaming endpoint interprets an empty (length zero) binary message in a special way: it is treated as a shutdown command to the server. The server will finish processing whatever data is still has cached, send the response to the client, send a summary metadata object, and then terminate the websocket connect.

    Responses

    Once Deepgram has started to receive audio data, you will begin receiving transcription response at frequent intervals. These responses are text-type Websocket messages, which simply contain a JSON payload:

    These are rolling updates over recent sections of audio. For example, the first response may cover the first 3 seconds of audio. The second response may cover even an longer interval, say the first 5 seconds, overlapping with the first response's three seconds. During this situation where subsequent messages have overlapping timespans, the transcript may be corrected, and IS_FINAL will be false. Eventually, the transcript will stabilize (usually after 5-7 seconds), and IS_FINAL will be true. The very next message will no longer overlap with the previous message, and will, for example, cover seconds 7 through 10 of the audio stream. If you only want these "final" messages, set interim_results to false in the connection URL (but note that they will still only return every 5-7 seconds).

    TRANSCRIPT is the transcript for the currently processing audio segment.

    CONFIDENCE is a floating-point confidence value (between 0 and 1) indicating overall transcript quality/reliability (0. is used to indicate that no confidence is available).

    CHANNEL indicates which channel this corresponds to (and is only present if multichannel is not false in the configuration message).

    Each word in the transcript also has its own WORD, START, END, and WORD_CONFIDENCE entry. START and END are floating-point times, in seconds, at which the word occurred since the beginning of the audio stream.

    The client may terminate the WebSocket connection at any time. However, if you want to ensure that everything that has been submitted has also been processed (i.e., "flush" the connection), then you can send an empty (zero-length) binary-type WebSocket message. No further audio may be submitted after this point. You will continue to receive text-type responses of any flushing audio, and the very final message you receive will be a ListenMetadata object describing the overall session. This metadata object is exactly the same as returned in a regular /listen query.

    Authorization

    It is incredibly simple to authenticate to Deepgram Brain using HTTP Basic Authentication RFC 7617. All requests to the API should include an 'Authorization' header whose value is the type "Basic" followed by a Base64-encoded username and password. For example, for user aladdin with password opensesame, aladdin:opensesame base64 encoded is YWxhZGRpbjpvcGVuc2VzYW1l. aladdin's requests to the Deepgram Brain API should all include a header like:

    Authorization: Basic YWxhZGRpbjpvcGVuc2VzYW1l

    API Keys

    Don't want to re-use your username and password in your requests? Don't want to share credentials within your team? Want to have separate credentials for your staging and production systems? No problem: you can generate all the API keys you need using Deepgram Brain.

    All of the following requests must be authenticated using your username/password.

    Create a new key

    You may now use NEW_API_KEY and NEW_API_SECRET just as you would your username and password to authenticate requests. Note that this is the only opportunity to retrieve the API secret, so be sure to record it someplace safe!

    List all keys

    This returns the list of keys associated with your account.

    Delete a key

    This will delete the specified key.