Building a live transcription badge with Deepgram
TRANSCRIPT
Building a live transcription badge with Deepgram
Hello there. I hope you’re having a wonderful day. My name is Kevin Lewis, and I’m a developer advocate here at Deepgram. Now, a couple of weeks ago, I built a small project and published it on Twitter thinking a couple of my friends would like it and it turns out a whole bunch of you were super interested in this live transcription and translation badge that I built. So of course, that made me want to build more and more features into it. And this video is where I’m gonna run through what the parts are how the software works and how you can build your own and use it in your day today. If you have any questions at all, please feel free to reach out to us. We love helping you build cool projects with voice. And with that, let’s get started.
These are all the parts you’ll need for this project. At the heart of it is the Raspberry Pi, This is a fully featured desktop computer similar to the larger desktops you may be familiar with except it’s tiny and it’s quite inexpensive too. This specific model is the Raspberry Pi zero two w. It’s quite important that you get the w models because it has WiFi and Bluetooth on board. And as this project will require an Internet connection. It’s preferable to have that built into the board. The Raspberry Pi zero denotes the size of this board, which is really light. and small. And because this is a wearable device, this is the one I would recommend. This is a fully featured desktop computer, but it has no onboard storage. So you will also need a micro SD card. This micro SD card has Raspberry PiOS version ten. buster on it. And because of the at the time of recording at least, because of the screen’s compatibility, buster is the version I need. even though there is a newer version of Raspberry Pi OS available. That just pops in there like so.
The next part of this project is the Hyper Pixel four. This is a four inch touchscreen by Pimoroni It’s really nice, again, quite light, big enough that you have enough of a touch surface. And what you’ll notice here are these holes on my Raspberry Pi. You have these pins here. Not every raspberry Pi comes with the pins pre installed. It may come with it as a separate piece and you have to solder them. I’m not good at soldering, so I bought a royalty pi with pre soldered header pins. But what you do is you just marry these up and you just kind of while being mindful of the screen, you know, and not wanting to put two month pressure on it, you just kind of push it down like so until it doesn’t go anymore. And that’s now a fully featured computer with an operating system with a touchscreen.
Now, just as a note, Pimoroni have an amazing setup guide for this screen, but you will have to install some drivers. And in order to do that, you’re gonna need, you know, to plug the roughly pie in to a more traditional screen like a TV or a monitor in order to do that first time setup. But once you’ve installed the drivers, it’s basically plug and play. To plug it into a more traditional screen, you’ll use this mini HDMI port here. And ideally, you’ll get a mini HDMI to HDMI cable or adapter. This also does require power, so I have my battery pack here and a cable for that battery pack. I do also have a smaller one that size like a credit card about the size of this device. but while developing I want a bigger battery, so I’ve been using this one. The final part really is a microphone. This is a little lapel mic. because the raspberry pie does not have a microphone on board. So this is a little one. I’ll link it in the description, and then I have a little USB c to micro USB adapter on it because this mic is actually a USB c mic. So you would put power into one of them the mic into one of them and then use ideally a wireless keyboard, which you can configure with a wired keyboard, I guess. in order to type into the device. But for most things, the touchscreen is just fine.
So now we’re gonna get on to talk about the software that runs on this device. The application that actually runs on the screen on the Raspberry Pi is actually a web application running in a full screen browser. And that gives us the advantage of being able able to develop for it and test it and run it on any device. So here I am on my desktop computer, and I’m gonna show you how the application works. Then for developers out there, I’m going to talk you through the code and how it’s put together. And then when we regroup after the run through, I will tell you how to deploy your own version of this project. So here it is in this emulator. So this is the kind of dimensions of the screen. We have this badge mode here, which is meant to just be static information that you can wear on you. Then we have transcribe mode. There’s two variants here. The first one is where are only transcription. And this one, even if it detects the second voice, will only show you the first. group transcription is a little harder to demonstrate with just me here. But if a second voice was detected that would be in a different color, a third voice that’d be in a different color as well and so on. Then we have the translation mode. You pick a target language from this list and it will go and transcribe your voice and then translate it. and show you the translated version of your voice. There we go. And then, yes, we have the badge mode.
So now I’m gonna talk you through how this is put together. Before I do that, I wanna talk to you about the third party services that we’re gonna use for this project. The first is Deepgram. Deepgram is a speech recognition API that can return fast and accurate transcripts in real time, and I’ll show you how that works. in a moment. Then we use i translate for the translation API. They have a whole set of supported languages, which is where the list on the badge comes from, and reasonable documentation that makes it clear how you can go ahead. and and make API calls. So what you need to do ahead of time is you will need a Deepgram API key So you can sign up for a Deepgram account. You get you get quite a lot of free credit. You create a new API key which has admin rights and doesn’t expire. and you’ll need that key. You’ll also need your project ID here.
I forgot to say this when I was originally recording but this next big chunk of the video is going to be talking through the code that runs this software So if you’re not a software developer and you’re just interested in getting this project up and running on your own device, then skip to this time.
So let’s talk about this glitch app here. So it’s a Node.js application on the back end and a v j s application on the front end. And, you know, I there were several ways I could put together video, we could build it together, but I actually think the easiest way to do this is to just talk you through the finished code and then provide you with the code in the description you can go and take a further look. But of course, if you have any questions, you can reach out to us on Twitter. You can reach out to us via email, and we are more than happy to help out and clarify any further questions. So the first thing I wanna talk to you about is the back end application, and this is the entire thing. This is an Express JS web application. where we have required an initialized express, Deepgram and the Axios HTTP library. And all this exists to do is two things. The first is to generate brand new Deepgram API keys that have minimal permissions and that only work for ten seconds. And that’s enough time to initially connect with Deepgram. And then if someone gets hold of this key, it’s useless after ten The second thing that this server side application does is translates phrases So you can make an API call to it, you can make an HTTP request, and you can specify the text and the target language, and it will return in turn the translated text. And that’s all this exists to do.
Most of the heavy lifting is done on the front end. So as I mentioned, the front end is a view j s application. And this is this is how it works. On the kind of high level, we have this this quite short web page which has two parts, main and a side. Now, to show you how how that translated. This is the aside here and this is the main here, this kind of darker piece here. So the aside is used just for navigation. Now what I want you to note here is in the URL, it says mode transcribe. When you click translate, because it was less code, Basically, it just refreshes the page with a new query parameter to translate. And if I hit badge, it’ll do the same with badge. So we use that there. And that’s important because based on that URL query parameter, we will display a different section, either transcribe, translate, or batch. Oh, badge. Let me undo that. Or badge.
Okay. Let’s talk you through let’s talk you through the v j s code here. So, again, it’s not terribly long, but we will take our time and work through it all. When the application is started. We do two things. The first thing we do is set the mode based on that URL query parameter. So if it is provided, we will go ahead and set settings dot mode to whatever it was. And if it’s missing, if you just go to the URL without the mode, we’ll just default you to transcribe. The next the next kind of method that matters here is navigate to without going line by line. What this will do is replace that query parameter or add it if it doesn’t exist and then refresh the page with that new query parameter. So that’s basically a very low rent router for this project. The other thing I wanna show you kind of in this first section is getting the user mic. This is supported in most browsers. You can ask the user for access to their microphone. You’ve probably seen those prompts before. and then it creates a new media recorder, which in turn lets us get raw data from the mic. In some browsers, this isn’t supported at the time of recording Fari doesn’t support this without toggling it on, which you can’t, you know, depend on users to do. So if it isn’t supported, we just pop up to the user. that it isn’t supported in their browser.
Next, we’re gonna talk about how transcription works. And to start off, I want to show you the transcribed section of the HTML. So straight away when that section loads, we present the user with two buttons. either where are only transcription or group transcription, where are only or group. And when you click those, it runs the begin transcription method with a different argument, either single or group. And in turn, based on that, the results will be shown in this or this diff, and there is a slight difference. For example, we need to add some indicator of who the speaker is so we can style it. So when we click the button, the begin transcription method begins. And begin transcription does a does that probably the most heavy lifting in in this application. The first thing we’ll do is just store the type of transcription so we can change which part of the HTML is rendering. We go and get that brand new Deepgram token from our server side and extract the key from it. We’ll talk about this line in a moment, but then what we do is we establish a web socket connection with Deepgram using our key. Every quarter of a second, we make data available from our mic and when that happens we send the data to Deepgram. And in turn when data comes back, we hand it off to another method which we’ll talk about in a moment. This line of code here actually does something quite simple, but I didn’t initially realize I needed to do this as more and more words are said. In fact, let me give you a demonstration here. As more and more words are said, obviously, the amount of text that is displaying on the page will be taller than the page itself. And by default, it won’t automatically scroll. By default, it will stay at the very very top of the page and you can’t see new words that are being said. So all that line of code does is constantly scroll to the bottom of the page. And as the page gets longer, we move with it. So that’s what that does. It does it every hundredth of a second.
So now let’s talk about what happens when the transcription comes back from Deepgram. What we do is we send data into the phrases dot pending array the phrases dot pending array. And you may ask yourself, what is pending? What is pending? What is this is final? And what is this phrases dot final. Why are we talking about final? So Deepgram went in live transcription mode will actually send data back to us quite rapidly with an interpretation of the words that were said. and it will continuously do that for any given phrase that you say until such a time that it becomes confident in what it has said and it will move on to the next phrase. So what we want to do is show data to users as quickly as we can, but that pending before its marketers final when that phrase is still pending, we do still want to update it until it is final. And this allows us to navigate that that kind of data that is returned. So we have our pending data. And if it’s final, we push it into the final array, meaning that phrase is no longer going to be authored by Deepgram. and we empty out the pending array. Additionally, if we are translating code, this is the point where we will also go off to I translate and begin translating, but we’ll talk about translation in just a moment.
The other things I just want to draw your attention to here are a couple of computed properties. The group transcript just adds together the final and the pending arrays. And the single transcript, what this does is only returns words spoken by the initial speaker. So if I’m speaking with someone else, Deepgram will pick up their words and it will return the words that they said. And what this computed property does It say, hey, if this isn’t the first speaker, don’t bother returning it. Right. Let’s talk about translation then. So we have here whenever data comes back from Deepgram, if we’re in translate mode, which is denoted by translate here in the URL. Then also go off and translate the phrase with i translate i translate supports just a string, so we will take the array that comes back from Deepgram and we’ll just turn it into a space separated string. And we’ll also indicate whether or not this is the final utterance. So here we have translate phrase We’re going off to our server side translate route handler. We’re specifying the language that is click when a user clicks a button, more about that in a moment. And when data comes back, we will push it into the array. And in turn, we will display it to users.
So let’s talk a little bit about language selection then. So first thing I wanna show you is this languages j s file that I created. i translate is wonderful. They have this lovely long list of languages they support. You can specify what language you want by providing a short what, two or five digit code for that language. But nowhere do they give you a structured document that allows you to put the labeled languages like Bosnian, with its two character or five character code. So I’ve done that for my application. I did it manually. You’re welcome to take this away as well, but here we have the codes that we need to provide when we make an API call and the label that a user would understand and want to click on. I also manually added this RTL property to Hebrew, This means right to left. Hebrew is a language, red, right to left, and we need to factor that in when we display it to users over in our index dot HTML or rather over in our script dot j s, just as a note in case you missed it, we’re loading in the languages here, so we can refer to them with view directors.
Now over in the translate mode, We put a different button in for every language in that array. We show the label of that language. When you click the button, we begin the translation with the code. The code that the user doesn’t need to see or care about, but we need as developers. You also see here that we have this direction styling that’s being applied with text direction. We’ll talk about that again in just a moment. So we begin translation. Beginning translation is setting the code in a place where we can access it later. And then it’s just beginning normal transcription because that’s the first step. Right? We need to get the transcription Once it’s returned, we then want to go and translate it. So then we go through this whole this whole process of, you know, beginning transcription, transcript transcription results, except this time, translate phrase will be called as well. then we’ll go ahead and translate the phrase like normal. Final things to note here in these computed properties, we have translated transcript. And what this will do is add together the final all of the final phrases and the pending phrase. And then finally, there was this text direction computed property. If the language in the language’s array has an RTL value, then we will go ahead and set the return value to RTL. Otherwise, it will be LTR left to right, which is actually the default. So if there were other languages supported that read right to left, we can go ahead and just add that to languages j s. And in the UI, that will be automatically applied. The final mode of course is badge. This one’s really brief. These are just hard coded values. You can be more fancy if you want and source these values from elsewhere. but I’ve just hard coded them directly in here.
So that’s a summary of the application. Once again, it’s a Node JS back end with a view JS front end. We’re using Deepgram for transcription and i translate for translation. When we’re translating, we first send our voice data to Deepgram, but then we’ll take that return text and we’ll send it off to i translate. Right. Now it’s time to talk about how you deploy this project for yourself. If you are a developer and you want to host this elsewhere other than glitch, please do refer to the code link in the description. You can take the code and deploy it wherever you wish. But by default, we’re gonna deploy on glitch because it’s free and it’s easy to take the code from where it is. and deploy it for ourselves. So you want to go to the description and you want to go to the glitch remix URL and you want to click that URL. Once your project is remixed, you will want to go to your dot n file and you will want to update your Deepgram API key. Once again, you can get that from the Deepgram console. You can create a new key with admin permissions. set it to never expire, your Deepgram project ID, and your i translate translation API key. So you will want to put those values in this dot m file. You will want to come to public index dot HTML and you will want to update the values on lines thirty eight through forty. I may well add to this project it will be around this line count And just to clarify what this looks like so you know what you’re updating, it looks like this badge mode. you can go ahead and update those values there inside of these angled brackets inside of these angled brackets. inside of these angled brackets. Then you want to hit preview, open preview in a new window, And I want you to take note of that URL. Whether it will be different for you once you’ve remixed, you’ll get a new URL. The other thing I want you to bear in mind is if you sign up for a glitch account, this will be hosted for free. If you do not sign up for a glitch account, this will also be hosted for free, but only for a limited period of time, and then it will be destroyed. You won’t be able to access sit on that URL anymore. So this may be the opportunity to sign up for a glitch account again for free, and it will host your application for you. but I want you to take note of this URL. Once you’ve done all of those steps, jump back to your pie join me in the next section where I will show you how to set up your pie.
So you can launch the browser on the Raspberry Pi type in your full glitch URL. Don’t forget HTTP at the beginning of your URL, not just HTTP. you’ll see a prompt to allow access to the microphone, which you can use the touch screen to confirm, and then you can go ahead and use the application. Now you’ll notice, of course, there’s all this extra stuff up up the top. You can go in here and hit the full screen button. And now we’ve just got our live transcription bad working, not a problem. Now, again, that’s, you know, a few extra steps and you may want to do what I’m gonna encourage you to do is look in the description where there is a link to a blog post on how to automatically launch Chromium in kiosk mode on the Raspberry Pi that is published on our blog. So I encourage you to do that and run through that setup.
And and yeah, and that is basically it for this project. So now you have a fully working badge that will transcribe your voice, a group’s voice, or translate, or of course, there is the badge mode here. Yeah. Hopefully, you found that interesting. Hopefully, you can go away and put this together for yourself. There aren’t that many parts once again, there’s the Raspberry Pi with the SD card, the screen, a PowerBank and a mic with an adapter, so it actually goes in. all of that will be in the description. The way I attached this to my body is actually by wearing thick denim dungarees overall and the bib just kind of pops in there. Though I’m in the process of having a case designed for this device, If you have any questions, do feel free to reach out. And hopefully, you can be walking around and being live transcripted for those who need it.