🚀 Voice Agent API is Now Generally Available 🚀

HomePodcastAI Minds #069 | Anshuman Singh, Co-Founder at Think41
Podcast
AI Minds The Podcast

AI Minds #069 | Anshuman Singh, Co-Founder at Think41

AIMinds #069
Anshuman Singh
In this episode, Anshuman Singh shares how Think 41 is building human-like voice agents and rethinking UX for India’s voice-first future. In this episode, Anshuman Singh shares how Think 41 is building human-like voice agents and rethinking UX for India’s voice-first future. 
Subscribe to AIMinds🧠Stay up-to-date with the latest AI Apps and cutting-edge AI news.SubscribeBy submitting this form, you are agreeing to our Privacy Policy.
Share this article

Anshuman Singh, Co-Founder at Think41. Think41 is a full-stack Generative AI consulting firm founded by the creators of HashedIn (acquired by Deloitte US), specializing in building secure, compliant, and human-like AI agents. We partner with enterprises and high-growth startups to co-engineer AI products spanning conversational AI, voice systems, and agentic platforms.

Listen to the episode on Spotify, Apple Podcast, Podcast addicts, Castbox. You can also watch this episode on YouTube.

In this episode of the AI Minds Podcast, Anshuman Singh, co-founder of Think 41, shares how his team is reimagining voice agents for India’s diverse, voice-first future.

Anshuman reflects on his journey—from building a startup acquired by Deloitte to launching Think 41, which just celebrated its one-year anniversary.

He discusses the company’s early breakthrough with a recruitment voice bot that screened thousands of applicants and kickstarted their deep dive into generative voice AI.

The conversation explores the real-world complexities of deploying voice agents—latency, cultural nuance, user frustration—and designing for seamless, human-like interaction.

Anshuman and Demetrios unpack how India’s multilingual landscape is shaping the future of UX, where voice is not just an interface, but the interface.

Listeners will walk away with insight into building responsive, intelligent voice systems and the subtle etiquette required for AI to truly feel conversational.

Show Notes:

00:00 Embracing New Waves and Mistakes

04:07 "AI Bot Revolutionizes Recruitment"

08:30 Enhancing User Interaction Feedback

12:45 Multimodal Voice AI Assistant Launch

16:26 Gesture Interfaces: Context Matters

17:12 Bot Interaction Etiquette in Offices

22:10 Latency vs Throughput in Scaling

24:33 Startup's Exciting Early Years

More Quotes from Anshuman:

Demetrios:

Welcome to the AI Minds podcast. This is a podcast where we explore the companies of tomorrow being built AI first. I'm your host Demetrios and this episode as always, is brought to you by Deepgram. The number one speech to text and text to speech API on the Internet today, trusted by the world's top conversational AI leaders, enterprises and startups like, some of which you probably have heard of like Spotify, Twilio, NASA. I have the luxury of being joined by the co founder of Think 41, Anshu. How are you doing today man?

Anshuman Singh:

I'm doing great. It's a very interesting day you caught us on. We just completed our one year anniversary starting the company and think 41. So great to be here, thanks.

Demetrios:

Huge congrats on that.

Anshuman Singh:

Thanks a lot.

Demetrios:

Know that company building is not the easiest thing in the world and this is also not your first time doing it. Can you explain to me your last company?

Anshuman Singh:

We started in 2010 and that was the cloud wave and I still remember four engineers working off a apartment room having a lot of fun. So that's what I remember from the time we started last and then this was into cloud engineering and we basically served a lot of enterprises as well as quite a few well known unicorn names in India. The company got acquired by Deloitte in 2020, so exactly a 10 year journey. So we call it the decade or of our life and funnily enough, just about when it got acquired, the next big wave started which was all the gen AI and the ChatGPT stuff and we were raring to go. So we started in last year, June and that's where we completed about one year.

Demetrios:

You couldn't sit on the sidelines for too long. Did you bring the same cast and crew with you?

Anshuman Singh:

Exactly, same cast and crew.

Demetrios:

So basically repeating the same playbook just.

Anshuman Singh:

With the new wave and hopefully make newer mistakes this time. I mean in a lot of ways these, these were big discussions, right? I mean having gone through that journey in a lot of ways, a lot of pain and a lot of fun it was a big decision to make to, to start that again. But just having a good set of friends to take that journey on with is what sort of pulled us in this direction as well. And ChatGPT, I mean this gen ai wave. I remember our time at cloud engineering. We were basically engineers trying to go and convince everybody that this is an awesome thing and you should adopt it this time. It's the other way around. all the business guys are coming to the engineers and saying, why don't you build me a conversational bot or an agent which can do a lot of the work.

Demetrios:

They're being pushed by their managers or their board or whoever their stakeholders are to adopt this. And it's more of a top down approach than that. Bottoms up. I can see that a hundred percent now. I wanted to center the conversation around the Voice agents experience that you've had and really the expertise in taking Voice agents to production. Because I think there is a gigantic delta between what we see when we open up our social apps and you get all these incredible demos versus when something is actually out there and being used at scale. So I've heard a lot of people talk about how there are fundamental tools missing for bringing AI to production today. Maybe you can start with some of the big pains that you've seen while taking Voice agents to production.

Anshuman Singh:

I'll get started with how we got introduced to Voice And it was a very natural introduction because the first thing you start after starting the company and registering it itself is starting to hire, And imagine four founders and we are looking to grow our team very quickly. And India is a large country, And we were in the AI space, so the amount of interest and applications we sort of got was in thousands. And being an AI company, instead of going through those thousand applications, we just built a bot. These are early days, So one year in AI is early, but one year back we built a bot which was able to do the first round screening. We have now launched that as an independent product of its own called recruit 41. But that was our first experience where essentially we went through some 20,000 profiles, ended up hiring around. Our team is right now 60 people strong.

Anshuman Singh:

So ended up hiring all of them. And that's what hooked us, That in a lot of ways Voice is one of the new capabilities. I mean, with Gen ai, there is a lot of talk about productivity increase and whatnot. Voice brings in a new capability which did not exist earlier and that's what hooked us and that's what got us started scaling. It was another thing altogether. the way I would say it is that the demos and the POCs and the natural sound of the Voice is what pulls you in as a consumer, as a customer, or even as an interviewer. But what keeps you there is the richness of the conversation, the humanness of it, the empathy, the whole context around not having to repeat yourself.

Anshuman Singh:

It's just understanding the cultural context and whatnot. One thing I would highlight is, and this has been, I would say, an unsolved problem till now is for our voice agents to be able to speak Indian names really well. we have a lot of context and India, is a large country, a lot of regional languages, dialects and whatnot. the shortcut we took in our interview design was to not say it at all. So we would basically avoid saying your name because there's a lot of risk, if you say somebody's name wrong, it does not feel very right. So that was one of the things which we saw as we moved over to a lot of the enterprise use cases. The first thing which hits us is latency, right.

Anshuman Singh:

So we came up with some interesting scenarios where as a human, like we are talking right now, I can pause, that is natural. I can, you know, say, uhuh, I hear you. That is natural. And then I can bring in all the context from the previous conversations and whatnot. Right? So we were seeing a lot of latency in the initial days and especially whenever we did some sort of agentic lookup or a rag lookup or whatever. And so we introduced just this aha. And just the user experience with that where, you say something, you finish your thought, there is that, the pause and the agent says aha. It gives you a lot of relief that somebody heard me, somebody acknowledged me and then even if it is a second later that it responds back that gap just feels very natural.

Demetrios:

That is such a huge insight where we as humans on the other side, when you can't see something happening, you are not sure if what you just said was received.

Anshuman Singh:

Yeah.

Demetrios:

And so that as simple as it is, gives you such comfort, for lack of a better word, that what you just said was registered. And now whatever is on the other side of the line is being processed and it's thinking. And I know that I've, I've actually heard other folks say they've taken it a next step and said when it is going and doing a lookup, it will mention. Okay, let me look that up for you.

Anshuman Singh:

You want it to be, I mean we all last 20 years or so, We have been doing this, these web applications, And in the web application a very Important part is showing that loader icon, showing that there is some progress being made in a voice. It is not the user interface, it is the sound. And there needs to be some similar sort of a mechanics where you're telling the user that I'm hearing you, I'm working on it and then get back with truly insightful next steps or recommendations or whatever. And that's where the second piece comes in that how do you ladder the context in a manner so that you don't necessarily have to get to the best response in the first second? So how do you get to a decent enough response and then you can improve upon that response a little bit later And a little bit later. So we sort of sometimes have these ladder hierarchies where there is a quick immediate response which could be that aha or the acknowledgement or maybe a one line thing and then follow it up with a lot more detailed thing that people would want to hear and which is essentially requires database calls, rags and all these agentic processing and whatnot to pull up the next content that you want to talk about.

Anshuman Singh:

I think that is where a lot of the engineering goes in. But I would say conversation design in a lot of ways the ux, I mean is no longer the user interface, but it is the audio interface. And that design needs a lot of looking into and careful design around. I mean one other thing which we found was how long should an agent talk? So we are doing this use case for one of the interior design companies and we designed the prompt based on typical user conversations. What we found was that if you see a face and you're talking to a person like you are doing with me, you are very generous with your time, You are letting the person talk for a longer period of time. There is this emotional connect, there is this facial expressions and whatnot, How do you express that in voice as well? Because I mean for us the experience was that our first design was not great. What we did was the person, we would go on a description, ask a question.

Anshuman Singh:

The person answers with a okay, or yeah, I want that or I like that. And it is the agent which is doing most of the talking right now in any interaction. And good communicators know that you want to listen more than you talk. And same applies for the agents and how you design that how do you keep your answers short? How do you keep your questions more open ended so the user feels they are more in control, they can talk about more things and whatnot. So that was another experience.

Demetrios:

Actually there's this fine line that I have recognized when I am dealing with LLMs through the computer, not voice agents, and I want an answer, but maybe I haven't been clear enough. So the LLM has to come back to me with questions. I don't want to sit there and be interviewed. And so you have to walk this tightrope of getting enough information but not exacerbating the situation and taking for granted that the person is giving you answers. I've seen some folks do it in a way again on text based apps where they'll just give snippets and then allow a user to click see more or click on buttons. In Voice, you don't really have that. Maybe you can say is it something around this or is it something around that? But because voice is a very one channel system and it's sequential, it's also again very hard to capture that. Have you seen ways of going about it so that you can just give somebody a taste and if they want more, they can click in deeper?

Anshuman Singh:

So perhaps not exactly that use case, but kind of getting into a similar way where even while we are having this conversation, and I will say I'm very focused right now and only talking about this, but imagine in a work setting, you are on a meeting, also checking your calendar, you're also checking the messages that you're getting. So in a lot of ways voice allows us to be multimodal, there's a part of our brain focus on talking and hearing and listening and there's other part which can go and click things and move around things and like you said, click on a see more thing. So one of the interface we are doing and it is currently under development and soon to be launched in the next month or so is. So we are a consulting company, We are a full stack generative AI consulting company. A lot of our time goes in talking to our potential customers, trying to understand their use cases. And coding these days is very easy, but understanding the context and what needs to be built is much harder, So we are building an assistant where a customer at their own free time or a client at their own free time can just go in, talk to the bot. And while you are having that conversation, live discussion notes are appearing. So think of this as the left side of the screen, there's a conversation going on on the right you have These cards which show up.

Anshuman Singh:

So you want to build Google authentication. And the person just clicks agree or maybe. And this is the other part of voice. You cannot rely too heavily on the words that it picked. So you want some sort of an acknowledgment, except especially for a high intent scenarios where you want to make sure, doubly sure that you got the right things. You would say I agree, this is what I meant. Or you can even show an assumption, I mean if somebody says I want to build a very secure application, you can show in three assumptions and the person can say yes, that is what I meant, or say no, that is not what I meant.

Anshuman Singh:

So just the bandwidth of information that you collect if you're doing a voice as well as these interactions that gets basically multiplied back to. And this is sort of trying to merge the well known accepted way of interaction which is the user interface and the web interface are layered on top with another thing which we all understand very well, which is having a conversation like we are having today, except that there is a bot which is talking and but I feel still the same that I'm on a work call with one of my colleagues who's taking notes and I'm saying yes, no or things like that.

Demetrios:

I mean that's the dream, that's the Jarvis of our time, right? Where we can be talking about something and then it gets pulled up onto the interface and we can tweak it there in real time.

Anshuman Singh:

I mean my hope is that most interfaces are like that in future, Because I mean if you're able to talk to your computer, we've all seen the minority reports and whatnot. But that is like the ultimate dream.

Demetrios:

The funny thing is I think I saw a study and somebody will have to fact check me on this. So take it with a grain of salt around how people don't actually like using their hands as much as we originally thought. in all the sci fi we have interacting with the interface with our hands and then talking to it and throwing things off screen or, or enlarging something with two hands and when given the choice, folks would rather just use a mouse and click. And so I find that funny because it is another one of those moments where it looks good on paper, but in practice it's not quite the same.

Anshuman Singh:

And I think the mode and the situation also matters a lot those gestures work great in a Nintendo Wii, if you remember, or Microsoft Xbox Kinect, and still continues with Sony, a PlayStation and whatnot. So if you're in the gaming mode, that is like the natural interface, Imagine being in an office mode and moving your hands around and somebody is speaking, what the hell is going on over there? so I think a lot of the social context and you know exactly what you're doing, where you are doing that impacts. In fact, that brings to my mind that, that particular application which we are designing, One very important and interesting problem to solve and we see this in our office a lot, you would see people talking to bots.

Anshuman Singh:

And this is think about people sitting in a room and multiple people, sort of talking to a bot. how do we design those interfaces where you prompt people to use, let's say the earphones or what is the new social. Let's just say etiquette around. If you're talking to a bot, do you take a meeting room and you go and talk to the bot in a meeting room or do you stay at your place and you talk to the bot and then the person beside you is also doing the same. Because in a lot of cases, the applications which we are building are for the office users and that is still a user experience level that needs to be figured out. That are we designing it for noisy environment? that is another big challenge to solve. Are we designing and sort of prompting the user to treat it like meeting a colleague, which means you book a meeting room and go there and talk to them.

Demetrios:

I can see in my own interactions with voice agents that I do not give them as much attention because I know they are going to hang around, they're not going to hang up. I kind of take for granted that they're there. And so I'll drift off and maybe do something else or I'll talk to somebody else. And we could say that I'm. I'm a little rude to the voice agent. Now, if we are expecting that when humans interact with voice agents, they have to be giving it their full attention, that seems like we're setting ourselves up for failure. Have you seen that type of expectation play out in practice?

Anshuman Singh:

So I would say we are in a very early stage of the whole adoption curve on this. these etiquettes we still need to figure out. For us I would just say that we were wondering, how far away is that time when we would have professional, let's just say, policies that you should treat a voice agent, which is talking about, let's say, some office IT compliance, like your colleague, be very professional and rude language to a bot will lead to some HR coming to your desk and talking to you about that. So once these bots start doing official work, we'll have to start giving them some more respect.

Anshuman Singh:

Although I don't know it will or should be to the level of respect that you're giving to a human. So I think that those are still problems to be solved and the social contracts to be figured out.

Demetrios:

I could see a company being very strict about wasting time when on a call with a bot. So if I'm doing something and then I let the agent hang for a little bit, but I'm talking in the background, I can imagine those are tokens that are getting transcribed or they're. That's money being spent.

Anshuman Singh:

Yes.

Demetrios:

And at the current point in time, maybe that is going to cost a lot of money or a little bit of money if you multiply it at a gigantic scale of an enterprise. Or maybe by the time that becomes so ubiquitous, it's negligible when we're talking about the amount of tokens for every five minutes of fat that's on a call with an agent. So that these are fun things to ponder, like how will that look? What will the policies be if you're interacting with agents.

Anshuman Singh:

One of the things we have been trying, and I wouldn't say very successfully is doing speech ID that who is the person who is talking. The second is even having some sort of, call it a speech ID as well as noise or background noise detection filter on the client side. So there are some decent models which are being launched which I have labeled where you can filter the knowledge on the client side. So I would imagine once those models get better and good enough, if there is crosstalk going on, there are no, hopefully there are no tokens being transferred to the server at all. And it kind of mutes the call on behalf of the user, sort of saving intelligently.

Anshuman Singh:

So I believe this is the fun thing of being in tech creates opportunities. Opportunities create these unknown and new situations. And then again, hopefully tech comes in and sort of solve some of them and creates more opportunities. So that whole cycle, even given the speed at which AI and voice and everything has been evolving. I think there are still quite a few more rounds to go about and come to like a more stable solution.

Demetrios:

Does the cleaning of the audio add to the latency?

Anshuman Singh:

One is latency and another is throughput, So if you keep the amount of hardware constant, Then at a higher scale it adds to latency. But in our experience, what it basically means is you scale out the hardware so it is not that much of a latency increase. Even when you do it on the server side, when you do it on the client side, the scale factor goes away, Because essentially what you're doing is whether you are on your phone or on your laptop, it is the user and the edge computing power which is getting used. So your servers can potentially support a lot more conversations. So I'll gave you the example of the 20,000 applications which we had to interview over the last six months or So what we did was we moved vad, the voice activity detection over to the client side, which meant that we were able to to about from somewhere around 600 to about 1000 at peak simultaneous conversations on a very cheap AWS and box, So I mean, for us, what mattered at the time was to, increase our scale. There is this other interesting factor, that if somebody is interviewing versus there is an angry customer on your customer support call The level of interactivity and the level of latency that they are ready to tolerate is also a bit different So for our use case, it made sense to push it on to the client side rather than increasing the server capacity. But probably for a customer use case, I would still advise increase the server capacity, make sure that the customer who's already angry about something does not have one more thing to be angry about.

Demetrios:

Well, this is interesting to note because I hadn't thought before about. So different use cases requiring different latencies makes complete sense. I can understand that because we have that in traditional software development all the time. I just don't know why it hadn't gone through my head for voice agents.

Anshuman Singh:

There's so many new things, that's the fun again of being at the forefront of this, Where there are new models coming up, we are trying it out. There are new use cases which are coming up. And those are the things I look back and say that it was the right decision. We took a year back to start this new thing and not having any baggage. This is like the first few years of a startup are the most exciting time.

Anshuman Singh:

Just being at the forefront of gen AI and voice and all these new technologies, that just makes it even more fun.

Demetrios:

I'm very excited for what you're doing and thanks for coming on here and Happy birthday to Think41. Hopefully it's another decade or more.

Anshuman Singh:

Would love too. And would love to keep this conversation happening. And hopefully we talk again on our next anniversary and we have a lot more to share on what all the things we have done.

Demetrios:

And so I would like that.

Hosted by

Demetrios Brinkmann

Host, AI Minds

Demetrios founded the largest community dealing with producitonizing AI and ML models.
In April 2020, he fell into leading the MLOps community (more than 75k ML practitioners come together to learn and share experiences), which aims to bring clarity around the operational side of Machine Learning and AI. Since diving into the ML/AI world, he has become fascinated by Voice AI agents and is exploring the technical challenges that come with creating them.

Anshuman Singh

Guest

Check out these Resources

View All