Deepgram is a leader in high-performance, scalable, and accurate speech understanding. We design AI models for performance, deploy them on a massively-optimized engine, and accelerate on state-of-the-art hardware (like GPUs). We build automatic speech recognition (ASR; also called “speech-to-text”), natural language processing (NLP), and natural language understanding (NLU) products.
So it probably comes as a huge surprise to learn that Deepgram’s neural speech engine—the part of our platform that drives the inference process at runtime and services customer requests—was built in Rust.
Well, it wasn’t always this way. In fact, this is the fourth iteration of our speech engine. The first three iterations were built in Python: you know, the lingua franca of data science and machine learning.
With each iteration, we learned a lot more about what we really needed to build, always treating our latest iteration as an MVP that paved the way to a brighter future. Our first iteration was, to be honest, a throwaway we used to learn what we actually needed. Our second iteration served real production traffic for a year. By the third iteration (which we never actually deployed to production), we had a pretty good system running. So why did we make the switch?
Because computers are complicated, and your choice of programming language influences how you must manage that complexity. Low-level languages like assembly, C, and C++ give you exquisite control over that complexity, often at the cost of productivity. High-level languages like Python enable incredible levels of productivity, at the cost of dealing with a higher level of abstraction around the underlying operating system and hardware. That doesn’t mean you can’t be productive in C, nor that you can’t make low-level calls in Python; it simply means that there are tradeoffs, and your language of choice provides a backdrop that influences how you prioritize those tradeoffs.
What kinds of tradeoffs do we face at Deepgram when designing a machine learning inference platform for speech?
First, we have memory allocations. Speech data requires a lot of memory. Think about it: without compression, an hour of stereo audio at “CD quality” (44.1 kHz sample rate) is 605 MiB. If users are streaming audio data to you, you are making many smaller allocations, too. And on top of it all, if you want to handle many parallel requests, you have that many more allocations to make! And depending on your programming language, you may allocate more memory due to unnecessary copies of data being made. Worse, allocating is a notoriously slow process, relatively speaking, sometimes requiring special allocators to avoid the frequent overhead.
Electronic Arts actually maintains an entirely separate implementation of the C++ standard template library in order to be more efficient. If you are working in a garbage-collected language (e.g., Java, Python, Go), you will also spend additional CPU cycles on freeing memory. Whatever programming language you choose, you’re faced with these difficult issues, but some languages allow more control.
Another problem is bottlenecking. Every program is going to be bound by some resource constraint, typically CPU, GPU (or other accelerated compute, if applicable) network, or disk.
Knowing your constraints can provide valuable insight into how to best design and architect your solution; too many problems these days are solved by horizontally scaling. Don’t get me wrong: horizontal scaling is awesome. But even horizontally-scaled solutions have bottlenecks, and if you ignore those bottlenecks during the design phase, then at best you are driving costs up, and at worst you are scaling a solution that won’t horizontally scale!
At Deepgram, we know that network and disk aren’t our bottlenecks—we had already built several iterations, after all, that proved it. That leaves the CPU and GPU. And here’s the funny insight: you want the GPU to be the bottleneck. Why? Because if the GPU is not the bottleneck, then it is not fully utilized; and if it is not fully utilized, you are wasting a valuable (and costly) piece of equipment.
But here’s the kicker: GPUs are really, really fast. So in order to be GPU-bound, you need an incredibly fast and efficient implementation that can complete its CPU-bound work faster than the GPU, but still fast enough to keep the GPU busy with new work. In terms of programming language, this means that you need to select a language that produces a very efficient application.
The final major consideration that drives a choice of programming language is parallelism: multi-threading and multi-processing.
If you try to keep a GPU busy doing speech recognition using a single thread, you’ll probably be sorely disappointed to discover that your CPU probably can’t keep up with the GPU—you won’t be able to create work fast enough to remain GPU-bound. So you need to create multiple threads at best, or fork multiple processes at worst.
Either way, complexity quickly increases the moment you invoke parallelism or concurrency: it is very easy to create data races or deadlocks if you don’t think about your design and your concurrency primitives. And the moment you throw GPU processing into the mix (which can be driven synchronously or asynchronously), things are that much more complex.
Oh, and there’s a curveball: GPUs are weird. It isn’t uncommon for them to drop off the PCIe bus entirely. Just gone. In the middle of being used. Usually, a system reboot is the only thing that will recover the device. And this is a reality you need to design around, because one of two things usually happen when the GPU vanishes: the thread/process crashes (that’s the “good” outcome) or the thread/process blocks indefinitely. That’s…pretty bad.
Your software needs to detect this condition and be prepared to work around it: recover state, requeue the job, etc. And if your programming language doesn’t have a really good concurrency model with good concurrency primitives, it gets really hard to juggle all of this.
What About Python?
We built three iterations of our speech engine in Python. We tried really, really hard to make it work. Here’s what we learned.
First, Python was memory hungry. It would have limited request concurrency on a given machine, but it wasn’t an immediate showstopper out of the gate. The garbage collector didn’t create huge slowdowns in any noticeable way. So we probably could have used Python and written off the memory inefficiency as a cost to horizontal scaling, doubling our costs if we had to. But this would be a shame, since a platform that is too expensive is ultimately one that cannot be used to scale.
Second, Python tended to bottleneck around the CPU. This is bad. This means we are wasting GPU time. It also means that the turn-around time (TAT) that customers measure for our requests will be notably higher than it needs to be. But at Deepgram, scale is part of our DNA. We pride ourselves on absurdly low latencies (compare us to other providers: we are usually several orders of magnitudefaster). And it isn’t just a bragging point: it is necessary to build our vision of powering the voice market of the future. Optimizations did lead to noticeable performs gains, but it became more burdensome to continue these optimizations as new GPU hardware and software was released: the “performance gained per hour of effort” metric didn’t drop fast enough to keep up with hardware gains unless we spent significantly more team time on optimization than we wanted. So at the end of the day, Python’s performance made it difficult to keep the GPU busy.
Our final consideration was around parallelism. Python is notorious for its GIL, which makes native Python code effectively single-threaded. Obviously, this places a huge damper on the ability to scale across CPU cores. Now, there are workarounds for the GIL, which usually involve writing lower-level code that side-step the single-threaded limitation. Some of those packages already exist, but not all of them do. And if we start writing code to handle parallelism in a language other than Python, then we suddenly lose the very benefits that we chose Python for in the first place! The alternative to Python multithreading that lets us continue to write code in Python is multiprocessing, wherein we fork copies of our application. But multiprocessing has its own huge set of considerations: do you pay the cost of IPC, do you use shared memory, etc., all of which just add complexity to the code without solving the business problem. Again, Python fell short here.
What About Developers?
Aside from the limitations we ran into with Python, there is another very serious consideration: the cognitive state of the developer. Think about all the challenges we face, like those listed above: memory streamlining, always keeping the GPU busy, juggle parallelism, recovering from GPU errors, etc. Those are facts of life, things we need to deal with. But they aren’t the things we want to deal with. We want to solve hard problems! We want to make an incredible dev experience! We want to build the scalable future of speech understanding! And herein lies the constant tension in our developer mindset: we want to focus on business logic, but we have to focus on state-keeping and upholding the program’s invariants. At one point, we realized that 80% of developer’s cognitive burden was dealing with keeping state, and only 20% was actually solving business logic.
This is actually a well-known problem. Anyone who’s spent any time thinking deeply about implementing solutions to technical problems (or who spends time around such people) realizes that there’s a different mind state you need to get into. Paul Graham called these people “makers,” and pointed out that they need to manage time very differently than managers. The reason is that context switches are incredibly expensive because of all the state-keeping that needs to happen in the programmer’s head. As the complexity of the problem increases, the more state a developer needs to keep in working memory in order to implement a solution. There’s a related problem, too: the more invariants that need to be juggled in order to solve a problem, the more likely mistakes are going to be made. This is why Gerald Holzmann rolled out his famous “ten rules”: to minimize the amount of state necessary and reduce cognitive load, allowing developers to focus on the correctness of the business logic.
Solving scale problems—like building a scalable speech understanding platform at Deepgram—is hard. Derailing a programmer’s train of thought is dangerously disruptive. And when your developers are allocating 80% of their cognitive bandwidth to addressing state problems rather than business problems, it’s time to rethink your tools.
Okay. So at this point, we were pretty convinced that Python wasn’t going to be the solution of the future. “Just one more” Python iteration wouldn’t cut it. Our developers were wasting far too much time solving problems that weren’t business critical. So what were the options?
Well, given the memory pressures we wanted to scale past, it felt like we needed to go closer to a systems language than Python. That also made sense given our latency targets and just how lean we needed to be on CPU utilization. C and C++ immediately come to mind as tried-and-true industry standards. And as draconian as those languages can be at times, they probably would have solved the memory and GPU-bound problems. But they didn’t really solve the problem of cognitive burden brought on by complexities like parallelism—you still needed to solve those “manually” and keep track of what is safe to do and when it is safe to do it.
Go-ing the wrong way
Go was the popular kid on the block: supported by Google, a dead-simple syntax, and a robust concurrency model. We felt confident that we could hire great Go developers, or even train developers to use Go, given how simple the syntax is. And for a compiled language, boy, oh boy, was Go’s compiler speedy!
The appeal of Go was so strong, so “obvious,” that we actually started implementing parts of our platform in Go. We treated it like an experiment: could we express our ideas in Go and avoid the pitfalls we ran into with Python? The experiment didn’t last long. The language was too simple: not having generics made us repeat ourselves a lot (nowadays, thankfully, Go has templates). The error handling was sophomoric. There were no sum types.
The idea of implicitly implementing an interface seemed…yucky: how long would it be before I accidentally implemented an interface and used it without meaning to? And the “empty interface” seemed everywhere: downcasting incurs a runtime hit, and we were trying to avoid unnecessary runtime costs to keep CPU usage below GPU usage; worse, the compiler can’t help prove that your downcasts are correct, leading eventually to runtime errors. Having a runtime, including the garbage collector, also seemed like an anti-pattern for what we wanted to do, where having tight control over memory was important.
Let me take a moment to point out that I am not saying that Go is always the wrong choice. There are undeniably many successful products and companies using Go. It may be the right answer for you. But it was not the right choice for our high-performance computing speech recognition product. It was the wrong choice for Deepgram.
So, we were left with Rust, the new, quiet kid on the block. Rust was being developed by a small community since 2010. By 2015, it finally had its 1.0 release. It was still a relatively immature language, without an asynchronous programming model and with lots of little issues in its borrow checker, but it satisfied us on several points where Go fell short. We had good generics, a simple and sane interface concept, no runtime, and explicit error handling without exceptions. It was a systems-level language like C and would enable us to tightly define our control flow. It was also fast—typically compiling down to binaries as runtime efficient as C—which would help with our CPU bottlenecking problems. The community seemed deeply concerned with ensuring good runtime performance and a minimal memory footprint; after all, Rust — like C — can be used in embedded use cases where resources are heavily constrained.
This alone wasn’t enough to sell us on Rust: everything we listed so far could be had from C or C++. But Rust went one step further. Rust promised a compiler with an incredibly strict static analyzer in it: the borrow checker. The borrow checker is the part of the compiler that enforces Rust’s most important invariant, ownership: any value may have either any number of immutable references or at most one mutable reference, and references must always be valid. This invariant leads to some rich emergent corollaries. For example, it prohibits the billion-dollar mistake of null pointers, since all references must be valid. And since two mutable references cannot simultaneously exist, you instantly avoid data races in multithreaded programming.
And most importantly, since the borrow checker is constantly upholding these Rust invariants across your program, no matter how complex it gets, the compiler suddenly becomes a friend and tool that’s helping you develop applications. It’s catching so many situations where you might have slipped up. Or put another way, it is relieving much of the cognitive burden facing developers as they juggle state-keeping. That means that your developers can now spend only 20% (or less) of their time dealing with state-keeping and checking invariants and the vast majority of their time solving business problems. That’s a huge human performance gain!
This doesn’t magically make programming easy, though. On the contrary, Rust has one of the steepest learning curves of any language I’ve encountered. As a first-time Rust developer, you’ll trip up constantly, you’ll get frustrated, you’ll fight with the compiler, you’ll give up, you’ll come back. It’s hard. It forces you to reason through and design your software differently. But here’s the important reality: it is making you a better programmer. You’ll approach software design in other languages completely differently after programming in Rust. It’s a long road to proficiency, but the payouts are huge.
The other problem with Rust is that the job market is so much smaller than other languages. This has been improving over time, but when Deepgram switched to Rust back in 2017, there were very few experienced Rust programmers looking for jobs. That meant that we needed to train our in-house developers in using Rust and to convince new hires that they’d enjoy Rust. But programming languages have never been a barrier for experienced engineers, yet because the learning curve to learning Rust is so steep, it meant that we would experience a medium-term drop in productivity in the hope that the future gains would pay off. Hopefully in six months, we’d have a huge and sustained productivity spike driven by a language that not only solves our fundamental problems, but helps unload the cognitive burden of state keeping.
Just like with Go, we decided to start implementing our platform in Rust as an experiment. It was going to be hard. We’d have to write a lot of the underlying tensor framework ourselves, since these libraries didn’t really exist back then. We didn’t have the benefit of Python’s rich package repository (nowadays, Rust’s de facto package manager, cargo, has a great set of packages for most use cases). What would happen?
This was the Wild West again where you couldn’t just search StackOverflow to understand an error message or how to work around the borrow checker—the Rust community was just too small! We were on our own. We almost abandoned the experiment at one point, as we struggled with depressed productivity due to the difficulties in mastering Rust. We tried to go back to Go. That lasted about a week before we reminded ourselves that Go was definitely not going to work. But that’s how desperate we were. But after about a month of developing in Rust, the Rust mindset was starting to make sense. We could see the Matrix. We kept going.
It took about six months to complete. By the end, we had definitely incurred technical debt: there was lots to fix up in the coming years. But the results were amazing. Our memory footprint was non-existent compared to the Python implementation. CPU utilization was way down while still keeping the GPU constantly working. There were almost no bugs related to concurrency issues. We didn’t run into segfaults (we’d all seen our share of segfaults coming out of C programs in the day).
But best of all, it was a product success. Our Python implementation was already heavily optimized for high-performance workloads. But switching to rust gave us a 30% - 80% performance gain, depending on the workload: lower latencies, higher throughputs, better request concurrency. We transitioned our workloads over to our Rust implementation as quickly as possible and never looked back.
Several years later, Discord announced that they were switching from Go to Rust. They were running into issues with performance, specifically related to the Go runtime and garbage collection. When we saw it, it was the nail in the coffin for Go: we were confident that we had made the right decision.
At Deepgram, we run a high-performance compute platform to power our speech engine. This means holding tight reins over control flow to ensure that you can keep accelerated hardware (like GPUs) busy at all times, while still being able to serve many requests in parallel. This isn’t easy. There is a lot that developers need to retain in memory to make this happen without bugs cropping up everywhere. When we switched to Rust, we saw cognitive burdens on developers drop significantly as they were able to spend more time focusing on business problems and simply rely on the compiler to uphold programming invariants. And to top it off, we made serious performance gains over our existing Python implementation.
Today, engineering uses Rust for everything. We even have some frontends written in Rust (compiled to WASM). We hire Rust developers; we hire and train non-Rust developers. Productivity remains high, as we had hoped. And our product experience has never been better.
If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .