
The story behind Deepgram Saga: Work at the Speed of Voice - Sharon Yeh, Product Manager at Deepgram | AI Minds #071

Deepgram is a Voice AI platform for enterprise use cases – speech-to-text, text-to-speech, and speech-to-speech APIs for developers. Control your workflow with just your voice. Tell Saga what you want done and see it happen in real-time, powered by MCP.
Listen to the episode on Spotify, Apple Podcast, Podcast addicts, Castbox. You can also watch this episode on YouTube.
In this episode of the AI Minds Podcast, Demetrios is joined by Sharon Yeh, Lead Product Manager at Deepgram, to explore how Saga is building voice-first AI tools that reimagine how developers interact with machines.
Sharon reflects on her early journey with Dover, where she helped automate recruiting workflows using AI—reducing manual sourcing and outreach through smart, ML-driven systems.
She then shares how those learnings translated into her work at Deepgram, where an internal dictation tool evolved into Saga: a voice-native OS and full MCP client for developers.
The conversation dives deep into multimodal function calling (MCP), and how prompt engineering is key to bridging the gap between natural speech and agent understanding.
Sharon highlights the challenges of aligning LLMs with human intuition—comparing today's AI to an intern on day one—and how Saga is designed to close that comprehension gap.
Listeners will come away with insights on building intuitive voice interfaces, developing for MCPs, and imagining a future where speaking replaces typing as the default way to work with machines.
Show Notes:
00:00 Automated Candidate Sourcing System
03:58 Dictation AI: Future of Speech Recognition
06:53 Evolving Communication Practices
10:46 Prompting Challenges in AI Communication
16:09 Prompt Optimization Insights
17:15 Optimize MCP Requests for LLM
More Quotes from Sharon:
Demetrios:
Welcome back to the AI Minds Podcast is a podcast where we explore the companies of tomorrow being built AI First. I'm your host, Demetrios. And this episode, like every episode, is brought to you by Deepgram, the number one text to speech and speech to text API on the Internet today. Trusted by the world's top enterprises, conversational AI leaders and startups. Some of these you all may have heard of like Spotify, Twilio, NASA and Citibank. This episode we are joined by the lead product manager of Saga, Sharon. How you doing today?
Sharon Yeh:
Good, good. Thanks for having me. Really excited to chat more.
Demetrios:
So I know that you have been doing some stuff with AI companies for a while now. Can you give us the breakdown on how you first got into artificial intelligence companies?
Sharon Yeh:
I think around maybe like four years ago, maybe four and a half years ago, joined a pretty small startup called Dover, which was trying to disrupt the recruiting space with AI and kind of like automated sourcing type of product. So that was my first introduction into, I would say all things AI honestly. And while I was there, used a lot of machine learning type things back in the day before AI really reached its heyday. And then as AI became more of a thing, we incorporated a lot of like generative AI and actually we built a product using Deepgram at the time, which was our recruiting, interviewer, note taker type of thing. And that's how I learned about Deepgram. And then fast forward a few years later, here I am at Deepgram and building Saga now.
Demetrios:
So I imagine there's a lot of stuff you can do in the product area in the AI space for recruiting. What were you focusing on?
Sharon Yeh:
So our product was mostly focused on automated sourcing. So sourcing is defined as basically finding the candidates that you are interested in and then emailing them. So you can imagine this is this saves like sourcers and recruiters like hundreds of hours of time essentially, because without AI, you kind of have to sit there, figure out the criteria you want, go on LinkedIn and then you have like a person kind of reviewing all the candidates, reading their profile, figuring out which ones fit. And then you have to do the whole emailing part, like drafting a nice email, making it personalized and sending that email. So what Dover did initially is we automated all of that. So you basically inputted your criteria. We would be able to get all the LinkedIn data, the profile data from all of your profiles that exist publicly on the Internet. And then we would be able to find the candidates that match what you're looking for in a more intelligent way.
Sharon Yeh:
So even if someone said software engineer as their role title, we would try to use, descriptions, keywords to figure out is this particular backend engineer, front end engineer, loop, look at their GitHub profiles, that sort of thing. So we did that and then automated the emailing part as well. So we generated personalized emails based off of their profile data again. And then that way they were able to send. Companies were able to send emails to hundreds of candidates per week versus, like having to hire recruiters or sourcers that would spend hours doing that type of work.
Demetrios:
Wow. Incredible. So then you jumped in at Deepgram and where you thrown right off the diving board into the deep end on Saga or what did you sink your teeth into first?
Sharon Yeh:
So when I first joined, the product that the team had been building was actually a dictation AI tool. So it was something called Shortcut at the time, it was meant to be a tool that we were trying to basically showcase Deepgram's obviously amazing voice AI technology in a way that people could use day to day. So they had a brainstorming workshop and Deepgram basically mentioned how they use dictation a lot and how there's a lot of room for improvement in that space. If you're thinking about when you dictate using your phone, you see the generic people being like, hi, period, please do this for me. Exclamation point. Like, that's dictation of the past. So we wanted to build a product that was dictation of the future.
Sharon Yeh:
You don't have to say that type of punctuation. And even if you say ums and filler words, it cleans it up for you automatically and automatically rewrites it in whatever style you want. So that was our original product and original tool. And then we launched that in, I would say, December of last year as like a product in itself. We also added the voice assistant as well. Just kind of like as an additional feature there when we launched it. And then it was kind of like, what should we do now? Or like what should we build next and how's the reception of it? Things like that.
Sharon Yeh:
And then so at that time, vibe coding and MCP became a huge thing. It was starting to become something that people were talking about. And so we were like, how can we capitalize on those trends?
Demetrios:
It probably became very obvious that, wait a minute, MCP Voice feels like there's a good mix here to do something.
Sharon Yeh:
Yes, a hundred percent. It was kind of like clicked for us because I think when I had first joined, when I was talking to the team, I think everybody was pretty aligned on our vision for the future. Like, it's how we want to disrupt how people interact with their computers. At the end of the day, if you think about it, typing, clicking, all of those behaviors are learned behaviors that we adapted to. When you're talking to people, you would rather use your voice. When you're thinking out loud, you're using your voice, so we always wanted the future to be you just talk to your computer and get things done. That seems like the natural way to work.
Sharon Yeh:
But MCP made it all very possible and very easy right now, even though it's still experimental, it's super cool to see what you can already do with it. And we decided to build the future now with Saga.
Demetrios:
It's so funny that you're talking about all these things and as you're mentioning them, I'm remembering my dad, for example, who would talk into a voice recorder. It wasn't like a phone or anything. It was a specific voice recorder with a tape. And then he would go and play that back to himself and type it up, or he would have somebody else type it up for him. And then the other thing that I think is interesting here is how you were talking about typing as a learned skill. And I remember back when computers were getting popular, I used to play games to see how fast I could type. And you think about all that and then you think about, this is so unnatural for us. How can we make it more natural? And I really like that you're thinking, what's the next step? What is the natural way of doing it? Obviously, we as society really want this because we make movies about it, the time.
Demetrios:
When we think about the future, there's no way that we think about typing away on a keyboard. So now's probably a good moment to talk about what exactly Saga is.
Sharon Yeh:
So Saga, our tagline is Voice OS for developers. We wanted to brand ourselves as a voice OS because we wanted to be future oriented. So from a product perspective, what does that mean? Right now we are a full MCP client, which means you can hook up multiple NCP servers and connect them to Saga and you'll be able to execute those actions via voice. I think we're one of the first voice native NCP clients out there. So if you want to live out Your Jarvis dreams, Ironman dreams. You can do that with Saga. And we also preserved our original dictation AI feature as well.
Sharon Yeh:
So you're able to dictate whatever you want and rewrite that in any style. So we also have a Vibe coding prompt that's called cursor prompt that you can use with AI coding assistants like Cursor or Replit or Windsurf to get better prompts to generate high quality code.
Demetrios:
I saw that. Can you explain how that works a little bit more?
Sharon Yeh:
So Vibe coding as a concept, a lot of people started using dictation AI tools to basically say what they want to say to the cursor agent and not need to touch the keyboard. So there are these people who are already thinking the same way we're thinking, which is I don't want to type anymore. I type slower than I speak. So we're taking that a level further, I guess. So not only can you just like, dictate whatever you want into the chat with your cursor agent, you can also basically select the cursor prompt and it will rewrite something vague into a very elaborate prompt. So, for example, if I just say something I want to build a voice AI app, the prompt will then rewrite it to say, build a voice AI app that includes these following features, and it'll list out and it'll be like a detailed, elaborate prompt.
Sharon Yeh:
And what that does is the cursor agent will then take that information and produce higher quality code. Think of things that you might not have thought about when you were saying something very vague. So it really helps you basically build an app or build a product feature, whatever you want, a lot faster. So you don't have to keep going with that back and forth with the agent.
Demetrios:
it fleshes out those ideas for you. I love that. Now, what are some things along the way while you were building this that surprised you or were some gotchas that you didn't expect?
Sharon Yeh:
So I think with mcp, a lot of things are very cool when they do work. But I think when people joke about, the hardest thing about AI is prompting, I feel like that is something that is very key, actually, in most of the features that we have in Saga. So there are certain ways that you kind of need to, phrase things for LLMs to understand what you're saying, understand the request that you're putting in. So the thing that I find really interesting with the start of MCP and people really using MCP is that it's an agent to agent interaction, almost like your end user is like an LLM. So when you're an MCP server, you need to make sure that the LLM understands all the tools that you're listing out. When you're an MCP client, you have to make sure that you're able to, your LLM is able to understand what people are requesting. And that bridge, I feel like, is actually the thing that could really 10x MCP in the future because we're not used to, speaking in the way that a ChatGPT or a Claude would really fully understand immediately.
Sharon Yeh:
You know what I mean?
Demetrios:
it's funny you mentioned that because a few friends of mine built a JIRA agent and it worked wonderfully when they were hacking on it and they built it for a hackathon and they plugged it into jira and it was a very clean JIRA that they were. They had some test sprints and it was amazing. They thought, oh my God, we need to put this into production right now. And they plugged it into their real JIRA and nothing worked. Because if you think about how you work with jira, there's like the least amount of context ever that you put on all of the different stories and sprints and epics and whatever it may be, because you know that another human's going to be reading it and they have all of that context already because you're probably meeting with them various times per week and you're syncing with them when they get stuck or you get stuck, etc. So if you're entering in information, you don't need to give the whole story, but because of that, it made the agent just fail wonderfully.
Sharon Yeh:
Yeah.
Demetrios:
And so the flip side of that is, okay, what if we are only trying to give information for agents as opposed to only trying to give information to humans?
Sharon Yeh:
It's been honestly very fascinating to. I've never like put myself in the shoes of like an LLM or something like that. But when I'm testing out MCP things, sometimes I'm like, how come it's not working? And then I was like, I didn't have the context, so I have to first be find this, and then, link those together. But I think like, as humans we tend to assume that the other person has context or think maybe like five steps ahead. And then right now where we're at with LLMs And I'm sure it's going to change at a rapid pace. But at least for now, it's kind of we got to take a step back and be very prescriptive with everything we're saying.
Demetrios:
That's such a great point. And it is so funny to think about how things potentially could change a lot. You can't give these fuzzy explanations or fuzzy ways of doing things, especially when it comes to one or two words, because potentially everything's fine and the LLMs understanding it great. But then you say a word that is a little too fuzzy for it to understand and it doesn't know how to interpret it. And it will go and it's really hard to prompt the agent to ask you for more information if it thinks it knows is just going to go and do it and then come back with the wrong answer. And so trying to make sure that it comes back and gets the maximum amount of context is also something that I've found very difficult.
Demetrios:
I don't know if you encountered that while building.
Sharon Yeh:
No, like a hundred percent. it happens all the time. My favorite analogy to use right now is I think right now, when you're using like an AI assistant or anything, it's kind of like an intern in their first week of their internship. You're kind of here's how you do everything. do this exactly how I want it to be done. Sometimes it still goes off and does something weird and you're okay, let's get right back on track.
Sharon Yeh:
And I feel like the ideal, ultimate vision is you kind of want an intern in their last month of their internship. You kind of want someone who already knows exactly what to do and you say something vague and they're able to take that and execute it perfectly right in the exact way that you want it to be done. I think that's kind of the bridge that. Is needs to be built or crossed right now.
Demetrios:
But you want that mind meld. It would be so nice if it could be that easy where they fully understand you, even if you are being vague. Another thing that I was going to ask is, did you find any surprises when it comes to the latent space? And you were mentioning how sometimes you really need to be specific about the prompts and hand holding. Were there words or key phrases that surprised you in that regard where you were I got much better accuracy or task completion when I just used this one word. Like, I remember at the beginning when we were all exploring LLMs, it was like, think through this carefully. And those types of like, almost chain of thought or think step by step, that became very popular to do. And now it's almost like an inherent trait that we have in LLMs, but especially the reasoning models. But maybe you found a few prompt tricks while playing around.
Sharon Yeh:
I would say I've mostly been playing around in the context of, MCP space in general. I think the main thing is if you are using MCP surfers, I think the easiest thing is make your request sound the most similar to the tools that you're trying to execute, which is a very like, maybe not advanced prompting thing, but it's like if the tool is named findemail, try to use that in your request. That makes it like the easiest for the LLM to understand what's going on.
Demetrios:
But I will say exploring.
Sharon Yeh:
No, exploring, search maybe will work. But if you're very specific, it usually executes the right thing. I will say with mcp, what's cool is you can chain multiple things. So I've been doing that a lot. I'll say, find this email and then draft a response, that sort of thing based off of the context, whatever. And then it can do multiple actions just with like one sentence that you say. So that's been working. I would say most of the time, obviously.
Sharon Yeh:
I think the step by step thing is still a thing. With MCP actions, you kind of have to say, do this, then this and then it's okay, got it. And then it will do the three things that you said. But my advice, I guess. just say something as close to the tools and make sure to break things down into steps.
Demetrios:
Incredible. I think that's it. Is there anything else that you wanted to hit on that we didn't hit on already?
Sharon Yeh:
No, I would just say if you're someone who's interested in MCPs, have used MCPs before. Saga is a great way to kind of uplevel that if you want to use your voice and not touch your keyboard anymore.
Hosted by

Demetrios Brinkmann
Host, AI MindsDemetrios founded the largest community dealing with producitonizing AI and ML models.
In April 2020, he fell into leading the MLOps community (more than 75k ML practitioners come together to learn and share experiences), which aims to bring clarity around the operational side of Machine Learning and AI. Since diving into the ML/AI world, he has become fascinated by Voice AI agents and is exploring the technical challenges that come with creating them.