🚀 Now Available: Nova-3 Medical – The Future of AI-Powered Medical Transcription 🚀

Podcast·Feb 28, 2025

AI Minds #056 | Jordan Dearsley, Founder & CEO at Vapi

AI Minds #056 | Jordan Dearsley, Founder & CEO at Vapi
Demetrios Brinkmann
AI Minds #056 | Jordan Dearsley, Founder & CEO at Vapi AI Minds #056 | Jordan Dearsley, Founder & CEO at Vapi 
Episode Description
In this episode, Jordan Dearsley shares his journey, AI voice tech challenges, YC pivots, and how he’s scaling innovations in speech technology.
Share this guide
Subscribe to AIMinds Newsletter 🧠Stay up-to-date with the latest AI Apps and cutting-edge AI news.SubscribeBy submitting this form, you are agreeing to our Privacy Policy.

About this episode

Jordan Dearsley, Founder & CEO at Vapi. Vapi enterprises deploy human-like voice agents in minutes. Whether you’re building a voice product or trying to handle millions of calls, Vapi’s reliable infrastructure and flexible APIs make it easy.

Listen to the episode on Spotify, Apple Podcast, Podcast addicts, Castbox. You can also watch this episode on YouTube.

In this episode of the AI Minds Podcast, Jordan Dearsley shares his journey of building an AI-first company and the evolution of Vapi.

Jordan discusses his experience as a Canadian entrepreneur, his time at Y Combinator, and the multiple pivots his team made—from an investing app to a note-taking tool and AI productivity software—before ultimately burning out and relocating to San Francisco for a fresh start.

Now, with Vapi, he and his team are pioneering AI-driven speech technology, enabling seamless real-time voice interactions through a modular API. He explains the technical challenges of speech-to-speech AI, the importance of speed and reliability, and the innovations that make Vapi's platform adaptable for various applications.

Throughout the discussion, Jordan reflects on the resilience required in the startup world, the complexities of voice AI, and the broader vision for integrating AI-driven speech into everyday experiences. The episode concludes with insights into Vapi’s recent Series A funding and ongoing hiring efforts.

Show Notes:

00:00 Startup Journey: Success and Burnout

04:52 Tech Startup Retreat Experiment

08:54 AI Models Near Human Performance

11:18 AI-Powered Speech-to-Text Innovation

14:48 Building Supermodular, Configurable Platforms

17:44 "Developer Platform for Voice Agents"

More Quotes from Jordan:

Transcript:

Demetrios:

Welcome back, folks. We are doing another AI Minds podcast. And this is a podcast where we explore the companies of tomorrow being built AI first. I am your host, Demetrios, and this episode is brought to you by Deepgram. The number one speech to text and text to speech API on the Internet today, trusted by the world's top enterprises, conversational AI leaders and startups like some of which that you may have heard of. Spotify, Twilio, NASA and even Citibank. In this episode I have the pleasure of being joined by the founder of Vapi Jordan.

Demetrios:

How you doing today, man?

Jordan Dearsley:

I'm doing good, man. That was a great intro.

Demetrios:

Appreciate that.  So I've almost got it memorized by heart.

Jordan Dearsley:

That's great. Well,thanks so much for reaching out and having me on the podcast. I've been very excited and I've heard a couple episodes so it's nice to be a part of it.

Demetrios:

Well, let's start with a bit of your story. I know you are Canadian. I won't hold it against you in the next 20 minute conversation, but you have been transplanted to the US. You started a company in Canada and then went through yc. Can you talk to me about that company that you started?

Jordan Dearsley:

It actually started as like five co founders or five people working on a school project. We applied to yc. We actually got into YC and they were time to like drop out of school and figure out what we're going to do with our lives now. Pivoted maybe 12 times during patch.

Demetrios:

What does that look like? What did you go in with and then how did you pivot?

Jordan Dearsley:

We applied with like an investing app idea and then by the time we had the interview, it became like a lecture platform for like professors. And then it just like it was like months of that. Wow. And then it became a button to join meetings really quickly in your menu bar. And that actually went pretty viral. We got it to like 10k weekly actives. I still run into people who are you were superpowered. I miss it.

Jordan Dearsley:

I'm Sorry for shutting it down.

Demetrios:

I was just going to say, I want that.

Jordan Dearsley:

I wish I had it, honestly. But yeah, those are the good days. And then we worked on that maybe for three years when ChatGPT came out, became an AI note taker. And so we were in like the scribe space for a little while and, and we used, Deepgram for that at the time. And we grew that to maybe half a million in revenue and then we just kind of burnt out, honestly. We got three, four years of calendar apps and productivity tools and not having a clear user and just kind of building for everyone and not a specific person. And so I don't think we ever had any ideas for how to create more value for people than we already were.

Jordan Dearsley:

So it kind of flatlined. Then we decided, let's just move to sf. We'll start from scratch. we'll throw everything out. The team shrunk as well because some didn't want to move. And me and my co founder, Nikhil just went into the, the abyss. And then we're trying to figure out what we were going to do with our lives moving forward. Nikhil almost ran off to India to go help kids and then, I've done that.

Demetrios:

Yeah.

Jordan Dearsley:

Well, I need him for now, so.

Demetrios:

Yeah, don't let him talk to me. I'll tell him how magical it was. Life changing.

Jordan Dearsley:

He almost went out to Kenya to do crypto for tribes. Haven't done that one, though.

Demetrios:

Yeah, that was fun.

Jordan Dearsley:

It was a little scary, honestly, to have your co founder.

Demetrios:

So it was like one foot in, one foot out, as you were trying to pivot your way into something that you felt like had more interest, more traction. And also I gotta commend the effort because what a cutthroat space like productivity tools. It sucks.

Jordan Dearsley:

It.

Demetrios:

Wow. No taker apps. Wow.

Jordan Dearsley:

Very competitive. Nobody wants to spend more than $10. And so you're just fighting for all these people who want to pay eight bucks a month. It's really tough. There is a move where we could have gone to enterprise and done that in healthcare or something. I think that would have been successful, but I just don't think we cared enough to keep going.

Demetrios:

So were you just going out and trying to diversify your life experiences and interview people, doing it the YC way, trying to figure out where's the pain, where's people's pains at who's using.

Jordan Dearsley:

I did that the first time around, but this time it was actually let's just lock ourselves in an Airbnb and SF for three months and just see what pops out. And we were just throwing a bunch of stuff at the wall. We were like studying like the Macintosh 1984 videos and stuff. how did, what. How did they do it? How did they make something that changed the world? I mean, probably the most significant thing was I think making a big sacrifice just pushed us to think larger and we didn't want to work on anything like small anymore. And so I one day had a very hard day and was like, I need something to talk to. And at the time I didn't have anything to talk to. There wasn't like chatgpt voice or whatever.

Jordan Dearsley:

Everything was shit. And so I was well what if I just build something? I'll stitch deepram and like 11 labs and like a whole bunch stuff together and see if there's, see if it sounds good. And of course it sounds like garbage because it's like really slow. And like what? The models at the time took like six seconds to reply. They weren't built for like real time streaming. Even then. I think the streaming API was still pretty early. And so we were trying and I went on walks for like two hours a day talking to this thing, rest of the day building, talking to two hours building.

Jordan Dearsley:

And then eventually it got kind of okay. And I was we're building an AI therapist company and we're going to go onto school campuses and hand out little business cards with like AI therapists phone numbers and people are going to call them and that's our business.

Demetrios:

That scares me just thinking about it because of this. It did not scare you at that moment. Like, people who are trying to. There's a spectrum of therapy needs.

Jordan Dearsley:

Right?

Demetrios:

And so if you leave some therapy needs that are pretty severe to an AI that can hallucinate and as you said, sounds like crap that maybe you need to be talking to an actual human for that could just get dicey quick.

Jordan Dearsley:

However, Uber was dicey. it means there's something that could be built there that's significant and hard. But I think again, no particular affinity for AI therapy. And I asked around and I was does anyone need like AI therapy? And there was nobody. There's like one guy who talked to us six times a day, but there wasn't much there. And like literally 12 times a day he would ring this thing up and like talk for hours. I was what is this guy? He was actually a good friend of mine, so maybe he was doing it to be nice and Eventually, I forget how this happened, but I think I was talking to another startup, they were working on voice stuff, and I was like, man, your voice stack kind of sucks compared to ours.

Jordan Dearsley:

I know, ours sucks too, but R sucks less. So why don't I just turn this thing into an API and we'll just do a whole speech in, speech out thing. You can configure the prompt in the API and then maybe this will just solve your problem. And it kind of didn't, but it kind of did. And so it was, it was just enough for them to be let's just use Vapi. And then it went down like 24, 7, et cetera. But at least like we were the team making sure it stayed up. Yeah.

Jordan Dearsley:

And got faster and actually got more natural. Those are things that they, that their company name was Hyperbound. They were doing like AI roleplay training for salespeople. And so we just work with them trying to find, where the line stops for where our platform ends and their platform starts and try to figure out if this company needs this, we can't build like a sales roleplay training module ator platform. It's probably a little early for that, but let's just decide what's generic and then try to find others. And then we kind of found a few others and then the whole thing started from there.

Demetrios:

Wow, what perseverance you had. And I really love hearing the stories about like just trying and trying and seeing what sticks. And you get things that do stick, like the one click or like the therapist you do see, there's signals there. But it seems like you're okay with having a bit of success and then scrapping it to go for the bigger success that, you know is possible.

Jordan Dearsley:

And I think at the time, what was different this time around was these models across the board, they're getting cheaper, they're getting faster, they're getting closer to human performance across the whole, like transcription, LLM, text to speech stack. If this continues well, theoretically we'll be at human performance in the next year and a half. So it's kind of like this time around we were let's look to a projected future based on first principles that this is going to happen. If these models exist, it's probably going to be still really hard to get them turned into voice agents and then get those voice agents to production, actual production, customers talking on the phone or in websites or whatever it is. So why don't we just start building that stuff now and then hopefully eventually we can, the models are good enough that we can take a larger variety of use cases, not just like lower risk roleplay training. And that is eventually what happened.

Jordan Dearsley:

So I think we also made a bet that Apple would release like Apple Intelligence or Siri and like that would change the world in terms of voice. I thought it would have happened by now, but it still hasn't. So actually I'm still waiting for that.

Demetrios:

Big moment where everybody am also I think Siri that can do stuff reliably would be incredible. And it is beyond my understanding why we do not have that.

Jordan Dearsley:

100%. And I think that's the big blocker to voice just being adopted as widely as it should be, even with the current state of the models right now is just distribution. Consumers just don't have access to these. it's in a chatgpt, like advanced voice mode. But it's not like in the room with me right now like on my here.

Demetrios:

Yeah. And it's especially like the classic one that you see is Jarvis and we don't talk to our computers and it is really weird that we don't because it's almost like we should be able to at least to like launch an app.

Jordan Dearsley:

Yep.

Demetrios:

Or toggle to an app, open bring me the window that has slack on it or show me the window. Yeah, to type even. And I know that we've had somebody on here from Super Whisper before and Super Whisper, I think I haven't heard of that.

Jordan Dearsley:

Okay.

Demetrios:

Their whole thing is like justhotkey, start talking and it will type and it will use AI to almost clean up your talking while you're typing or talking and then typing it in. But the one thing that I think about too from your side, and I gotta imagine you have gamed this out in various different scenarios and I would love to hear what your opinions are. Speech to speech models feel like they are not something that are going to totally just disappear. If anything, you're probably going to grow, grow, grow. How do you see VAPI in a world where now, as you said, there's more and more progress and speech to speech models get good?

Jordan Dearsley:

Yeah.

Demetrios:

Right now they're probably not.

Jordan Dearsley:

they're not yet, but they definitely will be. And so I mean up until OpenAI announced it in like May of last year, we were actually training a speech to speech or starting to train speech parts of speech to speech models. We had no idea what we were doing. But we were like, this is obviously what we're going to do. We're just building the proxy to it right now with all the other models. Of course that was naive at the time. But what we did decide is why don't we just make our platform as modular as possible. So modular that you can't even.

Jordan Dearsley:

Not not just can you switch out an LLM or switch out a text to speech provider or a transcription provider. But what if you could switch out the entire architecture and all work the same.

Demetrios:

What do you mean by architecture?

Jordan Dearsley:

Like moving from a transcription LM text to speech stack to a speech to speech stack. You can just toggle that like a switch, right?

Demetrios:

Oh, nice. Yeah.

Jordan Dearsley:

You still have access to all the same tooling, all the same integrations, all the same conversation workflow, like builders infrastructure integrations with like telephony, et cetera. So that's how I think about us is like we're everything in between the models, regardless of the state of them, and actually talking to customers on the phone and then everything that comes after that. So or on websites or whatever. But I would say primarily over the phone. And so we're like kind of the last mile, if that makes sense. Or the second last mile because other people build stuff on top of us.

Demetrios:

It makes a ton of sense. And it is obvious now that you say it, because there are going to be scenarios where I want speech to speech because I'm okay with a little bit more looseness, we could say. But then there's other scenarios where the person that is building on top of Abby wants it's true speed and precision and accuracy and no hallucinations. And there's going to be those constraints that each person has.

Jordan Dearsley:

And there will be like this big shift like right now I think traffic on our platform is like 999 cascaded system, like the three model system 0.01. But over time, I think end of this year, maybe it's February right now, so maybe we'll get to like 20, 30% speech to speech I would imagine. But the rest will still be cascaded. It's mostly because of speech to speech models. One, they're still not at the point of reliability that we need. They're not at the point of configuration that we need either. So like for example I can't use a custom voice with a real time API yet.

Jordan Dearsley:

They'll come out with it piece by piece. But like usually there's one thing that sticks out that is missing or that needs to be customized for this specific person's use case. Like I need postal codes to be picked up in this particular format or like just some weird shit like that makes it so that they. You have to use a cascaded system. And there's probably like four or five things in a real deployment that are custom like that. And so that's the way we approach things is like let's just build a supermodular platform. Let's expose as much config as possible and then let them do whatever they want with it. And then we try to do the hard work of mapping that config to these models as soon as it becomes available.

Demetrios:

Have you seen a lot of people using these open source text to speech models? Because they are really good.

Jordan Dearsley:

Like the.

Demetrios:

What is it? Kokoro or I can't remember the name know.

Jordan Dearsley:

I've seen a few of them. We try to be GPU less ourselves just because managing the infrastructure is. You guys it's a lot and it's much lighter weight if we can not require GPUs. So we don't run any open source models ourselves. We don't see a ton of pull for open source models either because honestly the 11 labs, et cetera, like they all have. You're. They're best in class and people want best in class on the phone.

Jordan Dearsley:

They don't want. exactly. Like I would say the minimum bar is the providers whose job it is to be paid to run them and train them.

Demetrios:

Okay, so you gave me a bit of a breakdown on what vaporization. There's the models and the whole voice part in the middle, but there's so much else that's around it. Can you explain that a little bit more?

Jordan Dearsley:

The short of it is once you have all these models, you need to orchestrate them so they all kind of sound and feel human interruptions. Make sure the kid in the back of the car isn't fucking with the transcription model. Like all those little things. Once you have it talking like a person, it's like now I need to integrate this agent with my data. And so hooking up these like tool calling endpoints, et cetera, we make that super easy in the platform handling the various states of tool calls when they come in. If it fails, it takes too long, et cetera. Like that all needs to be considered as well. So we have configs for that as well.

Jordan Dearsley:

On top of that it's okay I have an agent, it talks, it works my data now I need to make sure it does what it's supposed to do every time. And so that's where we talk about workflows or like determinism, like helping you build like step by step flows. So it's not just a prompt freestyling and hallucinating, but it's like get the first name now use the tool call. On top of all that, once you have it, it works every time. Whatever. You then need to scale that with infrastructure to tens of thousands of calls. That's like really hard.

Jordan Dearsley:

And the infrastructure for real time audio is quite exotic as I'm sure you guys know, especially for long running phone calls that need to have like sub 500 millisecond latency.

Demetrios:

Oh man.

Jordan Dearsley:

And then beyond that it's like how do we actually integrate with like telephony infrastructure and everything to make it go live? Once it's live, it's how do we actually observe these calls in production and then make sure that we learn from our mistakes, build tests out of the failed calls and then improve the calls over time. So that's the whole loop of it. But we're a platform mostly built for developers and engineering teams. API native. So it makes it so people can build whatever they want on top of us just like you guys. And right now we are I believe the leading platform for developers to build voice agents specifically. we're looking forward to this year and what Speech to Speech is going to bring to the platform.

Demetrios:

Awesome. And you're based in San Francisco. Are you guys hiring?

Jordan Dearsley:

Yes, we are. We are hiring every role across the stack. We are hiring aggressively. So please come join us. We just Razer Series A. We're a team of like 24. You can get an early after product market fit. The growth is insane.

Jordan Dearsley:

We just need help managing it. So please come join us.