Podcast·Jan 17, 2025

AI Minds #050 | Vaibhav Saxena, Co-Founder at Infer (Part 2)

AI Minds #050 | Vaibhav Saxena, Co-Founder at Infer (Part 2)
Demetrios Brinkmann
AI Minds #050 | Vaibhav Saxena, Co-Founder at Infer (Part 2) AI Minds #050 | Vaibhav Saxena, Co-Founder at Infer (Part 2) 
Episode Description
In this episode, Vaibhav Saxena delves into the fascinating world of personalized voice agents.
Share this guide
Subscribe to AIMinds Newsletter 🧠Stay up-to-date with the latest AI Apps and cutting-edge AI news.SubscribeBy submitting this form, you are agreeing to our Privacy Policy.

About this episode

Vaibhav Saxena, Co-Founder at Infer, is our first guest to make a second appearance on the AI Minds Podcast. Infer are a company which accelerates revenue growth with AI voice bots automating phone conversations and executing tasks like humans for insurance and lending.

Listen to the episode on Spotify, Apple Podcast, Podcast addicts, Castbox. You can also watch this episode on YouTube.

In this episode of AI Minds, Vaibhav Saxena returns to discuss the innovations and future of personalized AI voice agents.

Vaibhav, who is actively involved in developing AI solutions, discusses the psychological impact and engagement that personalized voice agents have on users. He highlights how subtle variations in voice, including tonality and speech patterns, significantly affect user interaction and conversion rates in business applications like lead qualification.

During the conversation, Vaibhav emphasizes the importance of voice personalization in enhancing user experience and engagement, drawing parallels with the evolution of personalization in other media such as email and SMS.

He explains the intricate process of experimenting with different voices and scripts to optimize performance and user satisfaction. This involves controlled experiments where only one variable is altered to determine its efficacy, thus gradually refining the AI’s approach.

Vaibhav also touches on the impact of regional accents and the need for AI to adapt to various demographic factors such as age and location, which can influence user preferences and responses. The discussion also explores potential future developments where AI voice agents could personalize interactions based on accumulated user data, akin to digital cookies, enhancing the efficiency and satisfaction of interactions across different platforms and services.

This episode provides insightful reflections on the current achievements and future potential of AI technology in creating more intuitive and personalized user interactions through voice agents.

Fun Fact: Vaibhav, is exploring the creation of personalized voice agents that can cater to different customer segments by varying not just the content of what they say, but how they say it. This includes aspects of speech like tonality and prosody, intricately tuning the responses to enhance user engagement

Show Notes:

00:00 Personalized AI Voices Revolution

05:11 Optimizing Voice AI: Conversion Focus

08:26 Evaluating Voice Speed and Impact

10:13 Voice Agent Segmentation Strategies

14:23 Personalized Cold Calling Strategies

19:29 Dynamic Personalization through Variable Tuning

21:19 Effortless Conversation Connection

More Quotes from Vaibhav:

Transcript:

Demetrios:

Welcome back to the AI Minds podcast. This is a podcast where we explore the companies of tomorrow being built AI first. I'm your host Demetrios and this episode, like all other episodes, is brought to you by Deepgram. The number one speech to text and text to speech API on the Internet today, trusted by the world's top enterprises, conversational AI leaders and startups, some that you may have heard of like Spotify, Twilio, NASA and even Citibank. We are joined for the first time ever by a repeat guest because at the end of our first conversation, Vaibhav, you said some things that made me have to say, hold on, let's run it back. What are you talking about there? Let's have another conversation on this because it is not worth putting an extra 25 minutes on the end of this podcast. Let's have a whole 20 minute conversation about it. So what did you say to me?

Vaibhav Saxena:

I said that during our lifetime we have come across certain set of people. I mean we all love humanity, but we have come across certain set of people whose voices, the way they speak, the way they articulate, we have paid more attention to compared to other people. And here's a psychological effect and impact where we tend to become much more engaged with certain voices which feel more familiar and relatable to us. Right within the whole AI voice agent world. I like to think what we have seen in the last year is the wave 1.0, the wave 2.0. While yes, the infrastructure would improve, the latencies would go down, they are unquestionable, undeniable facts. But where I would put my money on also because I'm building my own company, is how can we have personalized voice agents. What we saw in email as a channel and SMS as a channel happening where, first they impacted the whole mass media and then it slowly became more personalized and more catered to different kind of customer segments.

Vaibhav Saxena:

That's the bit about personalization. And within voice you have so many different variables. You have how they're speaking, what they're speaking, the tonality, the prosody, the way they are asking the questions. Are the questions being highly concise? Are they being more open ended questions? And again like I mentioned, there's a psychological impact to it. You tend to become more relatable to it. That's the unknown territory that we are trying to explore that what does Demetrios like or what does webhook like and people like me and people like you like and the whole reason being Engagement. Right. I mean, we love Morgan Freeman, but probably we know so many other people's voices which we don't probably know.

Vaibhav Saxena:

Right. So that is the journey where we are in and it's a highly experimental journey where we run a lot of experiments and we try to find out for different set of customer segments, what are the kind of voices tone, the kind of questions, how we handle objections work out. Because it's so different for different kind of age groups where people come from. Someone from New York might just prefer a New Yorker accent compared to when someone from Texas speaks to them. And it's such a exciting field to further dig deeper into. And obviously we would see a lot of work being or a lot of work going into this direction.

Demetrios:

So I could see that how we want different voices personalized to our tastes. How do you, as a builder of voice agents, know what signals are giving customers? This strong signal for propensity to enjoy a certain voice versus I don't like that voice. Is it just that I stay on the line longer or I do what the voice is asking me to do? Because I doubt you're going to ask the customer. So did you like that voice? Should we change it? Should we make it a little more high pitched? What do you think you would like to have a more intriguing voice?

Vaibhav Saxena:

Because we are in the business of lead qualification, which is almost like what happens in the sales domain. For us, what becomes really important is just the conversion rates. They are our North Star metric. And we try to see and rather understand through a lot of our evaluations how long has the customer stayed, what's the engagement looking like? And even today where we can say we are slightly maturing from early phase to the intermediate phase of voice agents, we still, when we talk to customers, we are like, some customer came to me and they were like, your voices are really robotic. And this has happened to so many voice AI companies. Well, probably that's just missing the humane part, which is I lot of times say that when we talk we have so many imperfections.

Vaibhav Saxena:

We use ahs and ums and when you see or hear rather any of these AI voices, they are just flawless, which just tells you that, this is not human. And hence these imperfections also give rise to engagement. We have seen that when we ran a lot of experiments that certain type of voices resulted in higher conversion versus certain other kinds of voices. And all these analytics and experimentation help us to my previous point, understand what set of voices work and result into a much more higher conversion rate for us.

Demetrios:

So you were mentioning how many different variables there are, right, between the tonality and then you also talked about how for you a very strong signal is do they actually get through the qualification place? And that is a. That's a huge starting point. I wonder if you are seeing false positives because no matter what the voice, the person would still get through the qualification phase. But that doesn't necessarily mean that they like that voice. And that voice should be their personal voice.

Vaibhav Saxena:

So we always keep within this whole discovery of finding the right set of voices, the way of asking questions. How much personalization are we doing versus something called control? And the control over here in the world of experiment is something which we are not touching. And that essentially sets a kind of a threshold or a benchmark for us that we are competing with, where we are not changing anything. And then there are other experiments we are running which is V1, V2, V3 until VN where we are changing the parameters and seeing. I always rely on the law of large numbers where as I'm running more and more experiments. Each experiment is given certain thousand calls. The number of false positive have far more lesser impact.

Vaibhav Saxena:

And again, my control is also telling me whether I've seen a significant improvement or not. So that's a good way for us to figure out few other thoughts which also come into this picture is we have seen certain voices so many different companies come in with different set of voices. Some voices are like a bit fast. Now they're also giving us certain, sliders to increase the speed, decrease the speed. And also that also makes me wonder that certain people from certain backgrounds do prefer when people speak a bit fast, right? Certain people have far more less patience when they're like, oh, can you just come to the point? And as we think more deeply about it, the human decision making is kind of correlated, right? Otherwise we would have probably in a board meeting taken everyone's ideas with the same grain or same pinch of salt. It's how when certain people speak, it impacts us more and we do really need to come up with certain more metrics which has a strong correlation. We have seen larger correlations, but not very fine grained correlation that if I just change the type of questions, that's the kind of impact that will happen. Or if I change how the objections are being handled.

Vaibhav Saxena:

For two different age groups, one age group might be 25 or 34, another might be 45 to 55. It has a huge Impact, we do see that, but the number isn't in billions right now in terms of the number of calls that we have done.

Demetrios:

Yeah. It just reminds me when you say that how much is not necessarily on the voice. And it is when you're looking at the segment, just as if you would segment out a marketing campaign, an email marketing campaign, for example, you want to think about segmenting out for the voice agents in the way that you're looking at where somebody's from, what age group they're in, how they've interacted in the past with a voice agent and what they've said. So if they said, yeah, hurry up, get to the point, that's a very strong signal that, with this person, we can try to prompt the agent to say things in less words. And it's just all of that is not related to the actual voice that the voice agent is using. Then you get on top of it, you layer on top of it something of a voice and that's part of it. But the cadence, the ability to get to the point, all of that stuff doesn't necessarily have anything to do with if the voice is high pitched or if it's male or female or any of that.

Vaibhav Saxena:

At times we also sort of see that the way of asking questions because the lead qualification call. And certain voices have a certain way of mentioning that does have an impact. We have seen female voices have much higher qualification rate. I don't know, for a lot of people, this might come as a surprise. For us, it was like the difference really matters to the second most voice, which had the highest conversion. But there is a difference that we see.

Vaibhav Saxena:

And obviously the other set of parameters have more impact at times. And like you mentioned, this is not completely novel, I would say. The novelty part lies in, how can you tie up with a type of voice? We have always been doing this for marketing campaigns, how marketers have segmented that. If this is my target segment, I'm going to have this kind of a messaging. We are trying to do the same thing, but we are also saying, voices do have an impact. And we are seeing that because we run an experiment where the set of questions are the same, but it's just we are using different voices.

Demetrios:

Yeah, okay.

Vaibhav Saxena:

And then we are saying, that's weird. Three voices are performing way better compared to the other three voices that we are using. So just getting the inbound traffic, dividing into different segments of how much traffic each version gets and just seeing it, keeping everything just constant, which is telling us something that people might prefer certain voice or certain other voices. And I'm so open to the fact that I hope there's a universal voice which just works with everyone.

Demetrios:

Would make life easier.

Vaibhav Saxena:

So much more easier. Yeah.

Demetrios:

It is fascinating to think about too. What features in the end user are going to give a little more insight into what voice they would like or how they would like to interact with their voice agent. Like is it age or is it location or is it where they grew up versus where they actually are in this moment in time? So thinking about all those different features is another piece that maybe you can make generalizations around, maybe you can't. Maybe you think that, wow, it's really important if it's this age group you do this type of thing, but then you recognize that that's just an assumption and it actually doesn't really help when we're trying to personalize the agents.

Vaibhav Saxena:

True. Which also reminds me that as I during the whole startup journey as I started to learn about know the aspect of cold calling and I would hear so many different videos by, veterans in this industry, they would do a little bit research about their prospect who are trying, whom they are trying to cold call. And if they know someone is from New York, they're going to probably crack a joke or two about New York because you want that first 10 seconds, just hit them up. So the part that you mentioned, which is where did they grew up? Where were they born? If you have those ideas and if the voice agent is able to adapt itself through experimentation, through a lot of data that, I already have the data about this person I'm calling in and I'm going to probably just to build some reputation and have better engagement, maybe try to crack a joke, but that might not go well with someone who is probably 60, 65 plus. Right. So I think a lot of those parts will be learned as we try to do more and more of this experimentation and understand what each segment and there are so many segments, different segments like and keep on pushing our engagement better, keep on improving our qualification rates as well.

Demetrios:

Well, it feels like a, the experimentation part is on the technical side of things, but a little bit difficult to track because you're running different experiments. How do you know what works and how do you know which signals you're getting and with what audience? And then how do you codify that insight? Because you see, oh wow, it looks like we're seeing there's an insight when there's people from XYZ Place and We make a joke right off the bat about that place, then people tend to enjoy it between these age groups. How do you then go and codify that and make sure that that now becomes standard practice? And on the other side of things, how do you just continuously experiment and never settle?

Vaibhav Saxena:

So again with all the experiments, the rule of thumb is keeping something constant and I might just keep my script the same. The kind of questions which I'm asking, the way I'm asking, I'm just going to keep that same. I will just only experiment with voices. Once I have figured out, let's just say six voices which are really good and converting my customers at a much more higher rate, then it is time for me to test out my script, which is the way I'm asking questions. Open ended or closed ended. It is an iterative process because if you are changing too many variables at the same time, you're not really arriving at a certain conclusion.

Vaibhav Saxena:

And then for each segment of customers, now how you segment can be the way you want it. Could be based on the age, could be based on the location. They are in urban area or suburban area. Each group or each segment can have their own share of experiments running. And hence the volume of calls becomes significantly important over there. And the faster you might be able to arrive at much more stronger insights which will lead you into a direction where, I'm seeing very high number of qualifications. And if you again take a step back and see how a lot of the sales teams around the world are trained, this is what happens when the sales coaches are reviewing the calls, they're saying, these are a set of things that you should be saying, they're trying to come up with, call it a superscript.

Vaibhav Saxena:

And the way you are communicating them and this is by listening to the recordings and understanding, okay, what's working, what's not working and building that intuition. But with voice agents which are completely digital, you can run ton of experiments and get those insights and then in some ways codify it yet, keeping it flexible.

Demetrios:

Yeah, that's brilliant stuff, man. Well, this is super cool to think about. Is there anything that we didn't get to talk about that you want to mention or talk about?

Vaibhav Saxena:

I think the we briefly did touch, but this was around the personalization bit. Experimentation kind of gives rise to personalization because for each customer segment we are understanding and nailing down what voice, how they like to be asked, question in which way? Right. That just makes a certain group likeness increase to certain sort of voices. And the way they are being talked about. That's. That's the only bit. Yeah. Awesome.

Demetrios:

Yeah. This for me is such a cool idea that I haven't thought about. And it is quite complex in how you can successfully do it. I like how you're saying you want to keep basically everything static and then just change one variable and see if it has a difference in the outcome that you're trying to go for. And once you tune that and you find the best variable for that one variable, then you lock that in and try and start tuning different variables and see what happens. And the level of personalization. I see a world where when I interact with different voice agents, they understand what I like and I don't necessarily have to tell them. They just know that, hey, if they want to get something done and they want a higher likelihood of me completing whatever it may be, the questionnaire or the telephone call, they've got to do it in X, Y, Z fashion.

Demetrios:

I wonder if it's not like we will have cookies and the cookies will be a little bit more personalized for us in that regard too.

Vaibhav Saxena:

Mm.

Demetrios:

That could be something wild to think about, but it is maybe a bit far fetched that there's gonna be that data sharing that if I call my bank, they're gonna know how I like my voice agent. And then if I call the Vodafone, they're gonna know how I like my voice agent. You know, it might be a little bit ridiculous to think that that data sharing would happen in like a cookie form.

Vaibhav Saxena:

Someone had told me this, that when you talk and when the other person responds, everyone hopes that I hope they just get me. And that's it. That's when you are like so badly hooked up into the conversation. Like when someone is at the end of the conversation and they're like, what question should I have asked you? And you tell them, you should have asked me this question. And that just like hooks you up. I think I would like to live in that world where I'm calling customer support AI agent and it just gets me the way I like to talk it, Very short, crisp answers.

Vaibhav Saxena:

Don't ask me very open ended questions. That's it. A voice which is very empathetic. Maybe a lot of other people would also like it, but I also prefer like short, quick questions. So I hope that we have that world and that world is like all interconnected.

Vaibhav Saxena:

When I go to order my food as well in a driveway. It's the same way that those questions get asked. So, I hope that world is probably around the corner for us.