
AI Minds #073 | Brooke Hopkins, Founder at Coval

Brooke Hopkins, Founder at Coval. Coval accelerates AI agent development with automated testing for chat, voice, and other objective-oriented systems.
Many engineering teams are racing to market with AI agents, but slow manual testing processes are holding them back. Teams currently play whack-a-mole just to discover that fixing one issue introduces another.
At Coval, they use automated simulation and evaluation techniques inspired by the autonomous vehicle industry to boost test coverage, speed up development, and validate consistent performance.
Listen to the episode on Spotify, Apple Podcast, Podcast addicts, Castbox. You can also watch this episode on YouTube.
In this episode of the AI Minds Podcast, Demetrios welcomes Brooke Hopkins, founder of Coval, to explore how simulation is reshaping the development of AI voice agents.
Brooke shares her journey from leading simulation infrastructure at Waymo to founding Coval, where she’s applying lessons from autonomous vehicles to conversational AI.
They dive into the challenges of building realistic, real-time synthetic environments to test AI agents—addressing non-determinism, edge cases, and probabilistic behaviors.
The conversation contrasts real-time and cascading model architectures, unpacking how simulation enables better decision-making, personalization, and agent reliability.
Brooke highlights how Coval’s simulation agents test voice systems at scale, helping teams identify failures, monitor performance, and iterate with precision.
She also discusses Coval’s benchmarks, which guide enterprises through selecting and evaluating speech models—covering latency, consistency, price, and naturalness.
The episode emphasizes the importance of treating evaluations (evals) as a core product component, integrated into CI/CD workflows and supported by human-in-the-loop reviews.
Listeners will gain a clear view into building resilient, high-quality AI agents and how simulation-driven evals are critical to scaling voice AI in the enterprise.
Show Notes:
00:00 Simulation Infrastructure for AI Challenges
05:54 Challenges in Customer Service Simulations
08:59 Challenges in Voice Activity Detection
11:24 Simplifying TTS for Technical Founders
16:25 Challenges in Voice Model Controllability
20:14 Effective Testing: Focus on Scenarios
21:57 Effective Software Testing Strategies
24:44 Cross-Functional Product Simulation Trend
27:55 Continuous Evaluation Importance
More Quotes from Brooke:
Demetrios:
Welcome back to the AI Minds podcast. This is a podcast where we explore the companies of tomorrow being built AI first. I'm your host, Demetrios. And this episode, like every episode, is brought to you by Deepgram, the number one text to speech and speech to text API on the Internet today, trusted by some of the world's top conversational AI leaders, enterprises and startups. You may have heard of a few of them like Spotify, Twilio, NASA. And in this episode we are joined by the founder of Coval, Brooke. How are you doing today?
Brooke Hopkins:
I'm doing well. Excited to be here.
Demetrios:
I am very excited to talk all about simulation and especially where and how it pertains to voice agents. I know that you started your simulation journey with autonomous vehicles of all places. Can you talk to me about that?
Brooke Hopkins:
I was previously at Waymo where I led our eval job infrastructure team that we were responsible for all of our developer tools for launching and running simulations, which is very similar to voice agents. I spent my career building out simulation tooling of how do we both create simulations. Simulations are very complex. Knowing how to configure them and making sure you're testing the right things as well as then running those on distributed compute is a really hard infrastructure problem, but also usability problem. Very complex developer tools. And so when I left Waymo, I was looking at what was happening in voice agents and saw that a lot of the same problems existed for AI around non determinism. How do you run probabilistic evals? Instead of saying I expect this exact test to have this exact output and then also being able to run these large scale simulations to show how often your agent is going to do what you expect.
Brooke Hopkins:
So that's kind of how we stumbled into voice, is that voice is actually very similar to self driving in that you have a lot of chained models that are trying to autonomously navigate a situation.
Demetrios:
Now when you say simulation, one thing that I think about obviously is games. And is that along the lines of how I should be kind of adding the two in my mind or what in your eyes is actually simulating these environments for the LLMs or the voice agents or the autonomous vehicles to play in?
Brooke Hopkins:
I think simulation is essentially just real time synthetic data. It's saying I want to create new environments where an agent or whatever you're testing the system under test needs to be able to interact with all of the things that it assumes are in the world in a way that's synthetic. So in robotics this might have been you create synthetic environments of a factory floor or a room for cleaning up, for self driving. This is how do you create synthetic environments for roads and being able to create pedestrians, cyclists, vehicles. And then for voice agents, this is creating another voice that's talking to the agent and going back and forth. I think there's many other types of simulation, like for example, in science or biology, chemistry, et cetera, or for other types of agents in the AI landscape. For example, how do you do SRE agents or more functional function calling agents?
Demetrios:
And one thing that this reminds me of is some friends who as soon as some voice agents came out, they put two voice agents together and started having them talk to each other with different prompts and trying to get to a certain outcome. Is that something that you're thinking about too? Because it's basically bringing this synthetic aspect to it. And you're involving one voice agent with one desire to try and get to a certain goal. And then you're testing your voice agent that maybe is for the company and to see how it fares against that.
Brooke Hopkins:
That's exactly what we do actually, is we build voice agents at Coval that are there to test your agents. So they're agents for eval, where our agents are on a mission to be able to test different flows. So you're able to say, these are all of the possible scenarios that I kind of want to test. And then we test that in lots of different combinations so that you can see how well your agent operates under those conditions. I think where simulation becomes hard is how do you know how realistic does a simulation need to be and how much do you need to simulate about the world in order to sufficiently be able to understand how your system behaves? You know what?
Demetrios:
Something short of the Matrix, I was thinking. Yeah. How. How angry do you need to make this simulation customer to be realistic? Or I can imagine, like real life is stranger than fiction. So a lot of times the simulation doesn't go far enough.
Brooke Hopkins:
I think there's two types of realism in simulation. We also had this at Waymo as well, where there's realism of this would never happen in the real world. Like you're defying physics or someone is just way too polite on the phone, or they're way too, the agent, maybe your agent under test swears at you. And the other agent, just the simulated user says, I'm sorry to hear that. Right.
Brooke Hopkins:
Like if you were a customer, you're never going to be okay. With the fact that an agent swears at you or these extreme examples. So these are obviously non realistic in the sense of like the simulation was creating an example that would never happen in the real world. And then there's what happens in the real world that's hard to create in simulation. People just saying things that you would totally don't expect. really anomalies that like long tail of edge cases. And then there's everything in between of like background noises or accents where you're all of a sudden speaking in different language use cases that you didn't imagine, but there may be, once you see them, you're that makes a lot of sense that my users would do that. Realism is a really hard piece of simulation.
Demetrios:
It reminds me how my friends were saying when they simulated a user with a German accent, the voice agent would automatically flip to German. But when they simulated a user with a Italian accent, the voice agent would flip to Spanish and it would do like this mix of Spanish, Italian that they could not understand. Why?
Brooke Hopkins:
We've also heard of agents like two voice agents speaking in English with each other and then all of a sudden switching English. Sorry, switching languages randomly. So you'll have a voice agent that's speaking in English to someone who's speaking in English to the voice agent, and then it will randomly hallucinate in Spanish or Japanese or something like this.
Demetrios:
The other agent just goes along because it can.
Brooke Hopkins:
Or in simulation in the real world, customers are extremely confused, and in simulation, the other agent just flips into the other language and it's great, now we're speaking in Spanish.
Demetrios:
Now we're testing the simulation in Spanish. Great.
Brooke Hopkins:
Another really hard problem is that the agents, LLMs, really want to be helpful, and so getting them to be a certain personality is very difficult. To try and keep them from flipping into another personality and be continuing to have the context of this is who I am and this is my personality.
Demetrios:
Continuing with that for a five minute conversation has got to be difficult, but for a 15 minute conversation sounds nearly impossible.
Brooke Hopkins:
Yeah.
Demetrios:
The one thing that I have heard folks really have difficulty on is everyone speaks in their own cadence and their own way. Every human has almost like a fingerprint of how they talk. And to simulate that so that the voice agent that you are creating can pick up and understand that, oh, Demetrius likes to pause, but that doesn't mean that he's finished with his thought versus Brooke likes to pause. And that does mean that she's finished and we can go ahead and change turns. Have you seen that the simulations will help along those aspects to make the voice agent almost, like, more robust?
Brooke Hopkins:
Something we do is we have different. You can configure in the personality of your agents, you can configure things like you're frustrated or you're a customer or you're a patient, but you can also do things like changing the interruptivity, changing the different voices, et cetera. And so that can be helpful for adding pauses, making the agent have a higher propensity for speaking slowly or speaking with more pauses. But this is definitely a hard piece. I think VAD is still a very unsolved problem. And there are more and more models coming out around VAD, Pipecat, LiveKit, both have really strong VAD models. But I think it's still a very hard problem that people are working on, especially because it varies even across the country.
Brooke Hopkins:
So people in the one side of the country are going to speak differently than another side of the country. People in different countries other than the US are going to speak in different cadences, et cetera. And then you add different languages in there and everything changes. So something we do a lot of too, is testing in different languages and being able to show how well you perform in those.
Demetrios:
Well, speaking of testing, I know that you've set up a bunch of different benchmarks, and I wanted to get into that because I think it's a great service that you're doing for the whole voice communication community. Can you explain a bit of what the inspiration was behind the benchmarks and how you went about doing them?
Brooke Hopkins:
So we started doing these benchmarks because everyone, lots of people are always asking us, which model should I use? Whether it's state of the art. And I think every single week there's a new model coming out with new capabilities. And the question is always, should I change to a new model? Should I stick with my existing one? And then on top of that, when you're just starting off, which model should you start with is always a big question. And I think even what to consider when you're choosing between models can be hard. Because in voice, AI with something that's even more complex than LLM applications is that you are choosing not just one model, but you're actually choosing many different models. You're choosing the speech to text, you're choosing the vad, the LLM, the text to speech. And then sometimes there's even more models on Top that like additional endpointing and whatnot.
Brooke Hopkins:
I think for even the most technical founders or engineers, this can be a really intimidating or like high learning curve place to step into. And so what we've done is we've created benchmarks around. We started off with TTS benchmarks. So being able to show which are the fastest, what should you even be thinking about with your models and then which are the fastest? For example, what's the consistency of those models? So consistency is really important because it's one thing to say that your model is the fastest, but if it has spikes with voice AI, that's super important that it doesn't have spikes because it's better to have consistently maybe a bit slower response versus no response. And then someone might, and then someone might think like did they hang up? Like where did they go? And get very confused. And then we also include audio snippets because I think TTS is one of the more kind of up to taste and kind of what you're looking for in the experience of the voice. And so being able to go through and see how different voices perform in different areas is going to be really important for being able to choose those. And so what we're looking to do now is actually revamp this website to bring in even more, even more areas.
Brooke Hopkins:
So this is like a sneak peek into what we're doing here around what should you even be considering in speech to text around time to first token, time to first response speed factor the price for different models and then the word error rates across the board. And so I think one of the big reasons why we wanted to do this is that even like being able to really clearly compare all these models, tell you what you should be considering and then have a go to place I think is something that's just really missing right now. And so we're just doing this for fun, I guess.
Demetrios:
Nice. Well, your fund is benefiting a whole lot of people. And if I can just add one action, ask for the next version when you create. I don't know if you ever played video games like FIFA or NBA 2K where they have the player statistics and it's in this circle and each different part. They have their skills in almost like a web. You know, this person's better at dribbling and they're better they, but they can't shoot. And what I would love to see is the models and like what the strong points of the models are versus where their weak points are. And you can see it in almost like this web like fashion.
Demetrios:
This one's really cheap and it also is really fast. But this one, you know, like the voices aren't as robust, whatever it may be. And then at a glimpse folks can just look at that and say like, oh, here's my dream team.
Brooke Hopkins:
Totally. It should also include kind of the. Where they're standing there like posing as you go around the circle. In the beginning of games, the avatars are like doing little.
Demetrios:
That's true. We could make these little avatars for each model. I like that.
Brooke Hopkins:
So that will take a bit more work, but hopefully the AI 3D avatars are getting there so we can just spin one of those up for a deep grammar. Curious to hear your input on what Deepgram's avatar is.
Demetrios:
That would be great. Yeah we'll work on that and I'll send you something over. Now. One other piece that I wanted to get into with you was how you relate all of this simulation to evals and what the connection is there. What are things that you specifically need to be thinking about when you're dealing with voice and it comes to evals because as you said before, voice is such a rich medium. It's almost like this double edged sword that you can get so much information from someone but because of that it makes it very difficult. On top of the fact that you're using more models, you have more input and more data that's coming in.
Brooke Hopkins:
So kind of the trade offs between cascade and real time models is kind of what you're asking about. So I think the trade offs between real time models and cascading architectures is definitely like you're going to get a lot more sensory input for these real time models where you're able to get voice to voice. You're making decisions about whether or not this person is done speaking solely with using all of the inputs from that audio. So like what's their tone of voice? Do they kind of quiet down in addition to what they're saying? And with cascading architectures, for example, endpointing is going to be harder in that respect. Or you can imagine that there's a lot of.
Brooke Hopkins:
You can hear the sentiment in the voice and so you can have a lot better predicted responses of what you should be saying. At the same time, I think the controllability of these models is still really hard. And so if you're at the frontier of what you're trying to push these models to do, I think for a long time These voice to voice models are going to be very hard to use for those cases. I think it mirrors though a lot of what's happening with LLMs where originally people were saying that one model for everything was going to be very bad because you wouldn't have this controllability and it's going to be inherently kind of mediocre at everything. But I think once you get to this step function of it's 10 times better at everything, then it doesn't matter that you don't have as much controllability because you can correct the model afterwards or you can just trust that the model is going to output something. I think we're still a ways away from voice to voice being 10 times better than cascading models. And there's still this problem that as soon as the voice comes out from real time, it's very hard to correct. So you either have to do still transcription and then analysis on top of that transcription, which then kind of defeats the point, or you have to just trust that whatever is coming out makes sense and make sure that your function calling or inputs to that model are really dialed in.
Brooke Hopkins:
So evals will be very important in those cases.
Demetrios:
Talk to me a little bit more about these evals and how you're seeing the best folks do them.
Brooke Hopkins:
I think evals should be thought of as a product. I think too many people are thinking of it as unit testing instead of saying this is what defines our product. Almost like a PRD of these are the things that we think our agent should be doing. And let's back into that then and be able to use that to shape our product. So for example, being able to say what is the happy path of should my agent be able to book an appointment, cancel an appointment, et cetera. And I think one of the traps we see a lot of people fall into is saying my agent can handle everything. And ultimately if your agent handles everything, it's going to be pretty mediocre at everything also, because that's essentially just chatgpt out of the box. Instead, I think what you should be doing is really dialing in like what are your features within your agent.
Brooke Hopkins:
So features should be thought of as what are the capabilities that my agent should be able to have? And then you're really running simulations of those and being able to show at scale how often those things are working. For some agents, you'll want higher reliability than others. Like financial services or healthcare is going to be a lot more crucial that you have all of these agents are like adhering to compliance, are following regulatory standards. And so you might want to run far more simulations of those agents. For agents that might be like customer service facing, where you could lose customers over it, but it's not a compliance risk. Those you might have fewer simulations but still have simulations. So that's kind of how we think about it in terms of how you approach simulations.
Demetrios:
So the simulations, having more of them almost feels like you can get to a place where you really understand your agent's capabilities. And you, with confidence can say, okay, we've run so many simulations that we're in between whatever 1% error rate or something like that. Do you find that it's the more simulations you throw at it, the more confidence that folks have?
Brooke Hopkins:
What you're saying is, you're really trying to get to a probability of your agent succeeding. So I don't think you need to necessarily run an insane number of simulations, like 10,000, in order to get meaningful results. But I do think the important piece is that you're not just saying, for this exact scenario, I expect this exact output and trying to force it in there, because then your tests are going to be really brittle, it's going to be very noisy. Whereas if you say, I expect these types of scenarios to succeed and for the conversation to be resolved and for my latency to be low, and for these types of things to not happen throughout the conversation, then you're going to get much higher signal. Because your agent, for example, like we saw an issue where it was listing out a numbered list over the phone, which sounds very weird. But ChatGPT loves to do this when you're interfacing over text, and it makes a lot of sense in that situation. And so this is like an example of something where you run and run across lots of scenarios and then look for whether what's the occurrence of this type of bug, what's the occurrence of my agent repeating itself? And so being able to run hundreds of simulations so that you can detect that is going to be a lot more fruitful than trying to recreate that exact thing in one single test.
Demetrios:
You have these simulations going. How are you seeing the best teams incorporate the simulations into their deployment strategies? Do you see like canary deployments or is it something like their A B testing or champion challenger, a little bit of everything or just kind of yoloing it into production?
Brooke Hopkins:
I think the best teams are doing local iteration, so being able to reproduce their issue and then iterating on that to get rid of the issue and then running a series of larger tests to be okay, does the rest still work? So you might run one scenario to really be like okay, reproduce this issue and then 10 scenarios to be able to say like okay, do these other basic use cases work? And then setting up CI CD so that when you submit your code or when you make a change that then it automatically runs those evals and then once it's in production being able to run regression sets. So this is very similar to how like self driving release processes work is that you're going to be once it's you'll have cicd, you'll have local runs. But then once it's in production running these larger scale regression tests periodically as well as then once you're ready to release actually to the road or in this case once you're ready to release your customers, then running a larger release set and maybe doing human evals at that point. So I think the point is to not get rid of human evals, but to actually just incorporate them into your eval process and make sure that you're leveraging humans to the best of your like to. For the cases that are the most useful for them to look at, not just every case.
Demetrios:
Is it being flagged within the simulation when certain things don't look right or it's a little bit fishy and you're saying yeah, you know what, these top 10% should be seen by a human or is it kind of random where you're just letting the human pick and choose and click through.
Brooke Hopkins:
So we're flagging them, we're helping you to like you can define metrics or you can define flags for like this is what should go for human review. And then have someone go, either yourself or a teammate, go through those simulations and provide feedback on them. And what you can do is then use that to tune audio metrics over time. So making sure that your LLM is a judge's really aligned with your human judgment. But then also using that, you can use that for RLHF, you can use that for just reviewing using that to see the aggregate analysis of your human reviews for things like naturalness and porosity, which are notoriously very hard to test for with automated metrics. I think there's definitely still a place for human reviews. That's super important.
Demetrios:
I really appreciate this idea of product evals should be seen as products or PRDs or I've heard Raza from Humanloop, he was like there should be no Difference between product metrics and your evals.
Brooke Hopkins:
I think something we're seeing a lot is that people across the organization are using these simulations for being able to show like impact on the product, being able to show their customers how well their product is doing and how it compares to others. So sales teams are using this, customer deployment teams, product managers for like where to allocate engineering resources as well as engineers who are really the drivers of these simulations to use them to consistently improve the agents. But I think this is going to be this is a trend of. In the same way product and engineering are melding and all sorts of other roles are melding and shifting. I think that the evaluation of how well a product is working is very technical and is increasingly cross functional where everyone has to have that technical capability to understand how you compare agents across the board.
Demetrios:
And how you can see how you can do the labeling after the fact, how you can know what does good look like, what's good enough. All of that is not the engineer's job. That's more of the subject matter expert's job.
Brooke Hopkins:
So I think it would be really exciting to see where product and engineering teams go. It'll be really exciting to see where leadership and product teams go within this more technical world where you really have to not just say we want the agent to do X, Y or Z. We actually need to be able to identify problem areas and know where to allocate those engineering resources.
Demetrios:
When folks are rolling out new updates to their voice agents, is it just like they put the API out there and then Koval goes and pings it and almost like swats it with a whole lot of different simulations and then it comes back with a score on how they did and you can read through. Okay, here are the thousand agents that just hit our new voice agent. Let's see how it did.
Brooke Hopkins:
So there are several setups depending on, I think there's a lot of platforms out there. So you have fully hosted solutions like Vapi or Deepgram has their voice to voice API, OpenAI's real time API. Or you have like self hosted solutions like LiveKit or Pipecat. And so the deployment environments for these look different. So if you're with LiveKit or PipeKat, you'll actually have code that you're deploying to a system. And so in those cases you use CI CD to deploy your agent. And so therefore you can run coval simulations in your CI CD for VAPI or these places, other hosted solutions where it's just a prompt or you're hosting it on another application, then what you can do is set up scheduled evals or pings so that you can run these simulations continuously or just trigger them whenever you launch production.
Demetrios:
I do like that idea to make sure that there's no drifting happening because as we know, as soon as you release that model, it's already stale.
Brooke Hopkins:
I think we get a lot of questions of like well I only change the prompt a couple like every so often so I'll just run it every time I do that. And you can totally do that. But I think that being able to catch as things drift or as different systems change, there's so many pieces that I think running continuous evals is really helpful.
Demetrios:
Don't let that drift. That's something that the data scientists probably all irk when they hear someone talking about that. No data. You can't, you can't do that. Data is not fresh. As soon as it goes out, it's not fresh.
Hosted by

Demetrios Brinkmann
Host, AI MindsDemetrios founded the largest community dealing with producitonizing AI and ML models.
In April 2020, he fell into leading the MLOps community (more than 75k ML practitioners come together to learn and share experiences), which aims to bring clarity around the operational side of Machine Learning and AI. Since diving into the ML/AI world, he has become fascinated by Voice AI agents and is exploring the technical challenges that come with creating them.