The New Age of Voice Commerce - Mike Zagorsek, COO, SoundHound - Project Voice X
This is the transcript for “The New Age of Voice Commerce,” presented by Mike Zagorsek, COO at SoundHound, presented on day one of Project Voice X.
The transcript below has been modified by the Deepgram team for readability as a blog post, but the original Deepgram ASR-generated transcript was 94% accurate. Features like diarization, custom vocabulary (keyword boosting), redaction, punctuation, profanity filtering and numeral formatting are all available through Deepgram’s API. If you want to see if Deepgram is right for your use case, contact us.
[Mike Zagorsek:] Thank you, everyone. It’s great to be here. I’m Mike Zagorsek. I’m the Chief Operating Officer at SoundHound. Thank you to Bradley for hosting this and bringing us all together. It’s always wonderful to get together as a group. You know, the voice AI community, what we do is is really powerful. It’s times like these that you really appreciate and recognize what a movement looks like. It always starts somewhere and and folks like Bradley are the ones who who help us get started. We have a very dynamic topic. It’s very exciting. But all drama aside, it’s something that we’re really interested and excited to talk about, which is really around voice commerce monetization. We talk about voice assistance, but where is it all going? Where do we see an interesting culmination point? That’s what we’re here to talk about. A quick glimpse into… to our vision at SoundHound as a company. It’s a four part vision. The the first thing we wanna do is build a conversational voice AI platform that exceeds human capabilities. There was examples actually with the the Alexa example how computers didn’t use to beat humans in chess. Now they do because we know if we invest in certain platforms in in in AI, we can accomplish great things. When we talk to voice assistants today, we tend to simplify what we say, because we don’t believe they can exceed human capabilities. Our vision is to surpass that so that they can do more than we can. But a key to this is delivering value and delight to customers. So that’s gonna be one of our themes here. If you’re not delivering value and you’re not delighting your customers, you can’t really move forward. We wanna bring that together and bring an ecosystem of products that are all connected and talking to each other and really enabling innovation and monetization opportunities.
One of the things that we focus on is empowering other organizations to accomplish great things using as much of our platform as they can. So the two things to focus on as we talked about voice commerce is delivering value and creating monetization opportunities ’cause that’s a benefit to the customer, it’s a benefit to the business, and that’ll be our theme moving forward. A quick glimpse into our our technology. So we’re SoundHound Inc. We have our platform called Houndify, described as a one stop independent voice AI developer platform, all the technology you need to add conversational intelligence. The key thing is you maintain customers con… control over your customers, data, and your brand. There’s a huge market out there for third party voice assistance. We think there’s an equally big, if not bigger, market out there for brands to extend their product experiences through voice using a custom voice interface and using a variety of technologies including ours. Three key technology breakthroughs just to talk about briefly. The first is speech to meaning. So for those of you who are familiar with us, this is something that we like to talk about. Typical voice experiences are a two step process. There’s a transcription from speech to text, and then the natural language understanding deciphers the meaning. When we developed our technology over a ten year process, we wanted to manage it the way the brain does. So when I speak, you’re not translating what I’m saying into text and then trying to understand the meaning you’re doing in a real time. We do that as well. It helps with accuracy and speed. Accuracy because we’re already processing using NLU as the voice transcription comes in, and it’s also… which makes it more accurate and it’s fast because it’s one step instead of two. The other is deep meaning understanding.
So this is parsing, multiple variables, complex and compound queries. When we speak, again, we don’t clip what we say into individual commands. We tend to say multi command statements. And then lastly, Collective AI, this is a vision to create domains of knowledge talking to each other, controlled by developers in order to enhance the world’s knowledge using a knowledge graph. A quick demonstration for those of you who haven’t heard it. I’m I’m using our voice assistant Hound, which is available. And imagine, for example, if you are looking for restaurants and you just had Chinese last night, if you went to a hotel concierge, for example, and you asked the hotel concierge, show me restaurants except Chinese, they would answer it in that way. But, typically, in most assistance, if you say, show me restaurants except Chinese, you’ll actually get Chinese restaurants back. So the build that they handle exceptions and details like that are really key. I’ll give you a quick demonstration. Hopefully, it comes through the microphone using Hound. Show me restaurants except Chinese.
[SPEAKER 2:] Here are several restaurants excluding Chinese restaurants.
[Mike Zagorsek:] k. So we handle that pretty well. A quick demonstration in the multivariant component. If you’re doing a restaurant search, you may wanna have a little bit more information. So you can say, show me restaurants in San Francisco, except Chinese and Japanese, that have at least three stars on Yelp, have a patio, and are open past nine PM on Wednesdays.
[SPEAKER 2:] Here are several restaurants with more than three stars in San Francisco that are open after nine PM on Wednesdays that had outdoor seating excluding Chinese restaurants, Japanese restaurants, or sushi bars.
[Mike Zagorsek:] So you can see it. Part of the… I talk a little bit about customer experience. It shows me the results here visually. Talk about the visual and the voice component. It also speaks it back to reassure me. We can do a follow-up queries and and variations. We can say, does the third one have free Wi-Fi?
[SPEAKER 2:] I may be wrong, but according to my data, Octavia, the American restaurant located at seventeen o one Octavia Street in San Francisco does not provide free Wi-Fi.
[Mike Zagorsek:] That’s ok. I’ll still eat there anyway. It’s a good restaurant. So that’s just a quick demonstration showing some of our speech to meaning and deep meaning understanding and action and how it ties together. So what we focus on is empowering Lar… currently large global brands to extend their product experience into voice. So we’re in automotive. We’re an IoT, retail restaurants, and multiple language globally. So our primary focus has been to take our technology and scale it into these different brands. So if you experience different voice components, whether it’s the voice mode in Pandora, it’s not here, but we power the the voice scan in in Snapchat or even now the new VIZIO voice mode capability in their TVs. And the reason this is important is… and this is why we’re here. The world is evolving into a voice-enabled reality. I mean, this is why we’re all excited about this business is what binds us together. If you look at the history of it, computers, they offered, you know, keyboard and mice and screen, and then we came with the mobile environment. So, you know, touch screen was the revolution, and, of course, now there’s this voice AI, which is really empowering IoT products. It’s hands free. It’s ambient. It’s everywhere. It’s a new interface. It’s a new reality. And, of course, when businesses… when computers became a reality, every company needed a website. I mean, it was just how it worked. Shortly after, and the time, obviously, is decreasing now with innovation.
Every company needed a mobile presence, whether it was an app or whether it was even a voice-enabled website. And and we believe, and I think we’re all here because we collectively believe that every company needs a a voice strategy and a path towards its own central voice AI. So this is the idea that every organization will have its own voice AI powered by a variety of technologies that is the extension of its own brand. However, we should point out that these interfaces are integrated and multimodal. Mobile didn’t replace computers. Voice is not gonna replace anything else. They’ll actually all work together, and we just get a better mix, better optimization between keyboards, mice, touchscreen visuals, voice only. The key is that it all empowered in together. So so so why voice AI? I think it’s an important question. Because we can fall in love with the technology, and we can all come together and see the possibilities and imagine a future. And it’s really keeping it simple. It’s it’s just about creating value. Right? We have to deliver something that’s valuable. And I think some of the earlier conversations was about, if you’re not creating technology that makes your life easier, better, then it’s just technology for technology sake. I like to say technologies made by people, for people. The technology is just the part and between. And for consumers, the list can be long, but it’s interaction from anywhere, safety of hands-free control, instant access, but it has to be value for business. You can’t just make it a one way street.
So you’re extending product capability, increasing competitiveness, customer retention. There’s all these stats that show if voice interfaces are valuable with customers, they become more loyal loyal to businesses. And there’s really different layers of value here. And there’s three, and I’ll talk about the first two. One is what we call just the core voice experience. This is really about voice enabling your product and service. You have to do that well. If there’s only one thing that you do, well, make sure that whatever it is that you’re offering is voice enabled. Because if if you mess that up, people aren’t gonna give you a lot of permission to do anything else. So this is a TV example. You can say something like go back thirty seconds or add this show to my favorite. It’s basic, but it has to work and it has to be clear. The next level is we call expanded. So this is connecting to the world’s information. So you could say, what will the temperature be at noon? Who won the baseball game last night? And this is really bringing the outside in. And this is really that where people start to think about voice assistance because they can answer a lot of questions from the outside and and… or IoT. And these two things alone have already generated a lot of business. I mean, companies are getting on board. You have sixty seven percent of companies who have adopted some kind of voice assist in technology and those that have have extended it to their mobile app. So once they’re in, they double down on it. Even adding to the websites, if it’s still early, but they were seeing some meaningful percentages there. And then, of course, even just embedding into into various products. So companies are getting on board. We know that the excitement is there. The market opportunity is massive. There’s different ways you can slice it. This is a hundred and sixty billion total addressable market scenario, some stats here, ninety percent of new vehicles globally, well, are projected to have voice assistance. Many of them already do. Many of them are our customers.
Seventy five billion connected devices worldwide by twenty twenty five. That’s significant. Seventy five billion connected devices. Most of those won’t have a screen or an interface. So if you think about you have an IoT device that is limited, it has to connect to a phone or connect to another device simply with a microphone and speaker, which are very cheap and inexpensive, you can empower those IoT devices. There’s seventy five billion of them in ways that you couldn’t before with other interfaces. Eight billion voice assistant devices, it’s… you know, some are for a doomsday kind of prediction here, but, you know, expected to surpass the world population. I think we’re far away from anything that’s negative there, and then the vast vast majority of large companies are are already working on voice AI strategies. So back to this, we go to these layers of value and say, ok. Great. There’s a market here. There’s a business. We’re already seeing it. But if you look back at that hundred and sixty billion dollar total addressable market, the TAM, well, where is that gonna come from? How is it going to happen? If you’re in a position to do these two things well, you can then expand into a monetized environment.
So if you’re watching your TV and you you… you’re already using the voice experience and it works well and you’re connected to the outside world, why wouldn’t you ask, I’d like to order some pizza for delivery or maybe I need to order a soundbar for my bedroom. So this is the progression towards monetization. This is the path that customers were taking because you establish the core value there. By the way, this works for for… I’m talking about voice AI here. But this can work in any other form.
I mean, you can have a chatbot that’s entirely text based. And if you’re delivering a core service and it might be through text, your ability to do it while opens the door to these opportunities to start monetizing. And the way it breaks out is is another simplification. It goes from commands to queries to transactions. Right? So I’m telling my product to do something. I might ask you the question. If you combine commands and queries, that allows you to transact. Obviously, there’s more that needs to happen underneath it, but that’s the gateway and the door moving forward. And and we conducted a study. We partnered with Opus Research to ask these questions. And we weren’t surprised, but it was validated to see how many businesses are really getting on board on monetization there. We ask them, is it more important, less important, or or not important at all? The vast majority in differ… across different industries said this is really important to me because the idea of monetizing, it’s a new form of revenue, and the theme starts to be, how do I turn something that’s in from a cost into a revenue stream in a way that empowers customers. Customers themselves are getting on board already. If if if you’re a customer of a voice-enabled device, forty three percent have used it to shop in some format, and just the shopping revenue alone by next year is expect to be somewhere around forty billion dollars that’s voice related. Not only do businesses and customers agree.
The experts agree, one from Bradley Metrock. You might know him. Successful voice assistance create a platform from which companies can upsell. Conversational AI decreases transactional friction. More space opens up, it increase per revenue. That’s a really important point, transactional friction. I have to… I can do less. I just ask for things, and if it works, I’m gonna do it over and over and over again. Great contributor to the community, doctor Bajorek. And this really segues into the next point, voice technology about inter… integration rather than replacement that speaks to the additive component. Meet users where they’re at. Try one use case with them and learn from them how they use the technology. The ROI for voice is about looking at user data, crafting high quality user experience based on how users are interacting with your product. So a lot of the the presentations you’ve already seen has been talking about listening, observing, and better understanding, but the key thing here is is adding value. Some… just for a quick examples, and I won’t go into this too much, but automotive, if you can think of food ordering, booking service appointments, parking, filling the gas tank, finding a charging statement. We already know that more than two x adults have used voices in the car… even more so than using this as a smart speaker using the voice experience, TVs and speakers, delivery, booking a car, the list goes on. You know, ten percent more people in twenty twenty are shopping from IoT devices than they did even just a few years ago. Smart home and appliances, product refills, maintenance history requests, just a simple idea that you can have an appliance, and you can transact through it in a way that keeps that customer relationship starts to generate revenue, and, of course, travel and hospitality. We’re all here traveling.
Wouldn’t it be nice to be able to do more with your voice specifically? Massive growth is expected there. But question’s how how do we get there? And this could be a full day seminar in and of itself. So I’ll focus on one… specific one. It… it’s always about the customer experience. If you deliver something that works that is a valuable to customers, they will continue to use it. And the… I kinda stole this from my my marketing days because it is actually a marketing challenge. Because what does good marketing do? It’s really about delivering the right message to the right customer at the right time. So it’s really understanding when do they need it proactively or reactively. But what’s the approach? We have to be where your customers are, and and they have to be able to find your high quality content when they want it and whenever and however they prefer to consume it. So if you push it on them… I mean, you think about what advertising is, people don’t like advertising because it doesn’t solve a problem. It’s just pussy… pushing a message, and it’s hopefully trying to get you to buy something maybe you want it, maybe you don’t. But a really powerful suggestion at the right time could be an ad. It doesn’t feel like an ad. It’s solving a problem. So it’s solving a problem for the business because they’re getting in front of customers that they want. It’s solving a problem for customer, ’cause they’re getting information they need. It stops feeling like marketing. It stops feeling like advertising. It starts feeling like an informed conversation, and that’s really where things are headed.
Another quote from our good friend Cross the Pond, is it working? He talks about whether in the process of implementing your strategy or just getting started, if you think what the pain points customers have throughout the customer journey, the kind of conversations you can have that’ll help, and the number of touch points that could be voice enabled for many brands, even a minor percentage increase in sales will be more than enough to justify the investment. The business case will rise itself. So you’ve got all these interactions. Millions and billions of interactions happening through voice. You’re learning about each other. Just simply making the right suggestion in the right time really will move the needle significantly without even getting into the numbers. And it’s really what we’re talking about here is creating what we call monetizable moments. It’s not saying, hey. We have a platform, and we wanna jam it into people’s lives. It’s… what are those moments that you can add value to the customer where transaction can take place? That’s where we really believe things are are headed in a conversational manner and monetization will actually take place. So it’s… you know, we’re live… we’re… live in the world surrounded by the power of voice, people are asking questions, getting their needs met, and simple suggestions make their lives easier and more convenient. We have a quick two minute video to share, something we put out a few months ago. I have not tested the audio. So let’s hope it plays, and if not, I’ll I’ll fiddle with it. Two minute. And this hopefully will will bring together some of the things I’ve just shared with you.
[SPEAKER 3:] We’re living in a world surrounded by the power of voice. With Houndify, your voice assistant now delivers point of revenue and unlimited possibilities for your users.
[SPEAKER 4:] Start navigate —
[Mike Zagorsek:] Alright. Let me start over.
[SPEAKER 3:] We’re living in a world surrounded by the power of voice. With Houndify, your voice assistant now delivers real revenue and unlimited possibilities for your users.
[SPEAKER 4:] Start navigation for the fastest route home.
[SPEAKER 5:] Sure, David. Starting route guide. You’ll arrive home at six thirty. Would you like to order dinner for delivery?
[SPEAKER 4:] Sure. What about pizza?
[SPEAKER 5:] Best Pizza has a promo. Any large specialty pizza for twelve ninety nine. Would you like to place an order?
[SPEAKER 4:] What’s the most popular item?
[SPEAKER 5:] Their top selling item is the combination with red sauce.
[SPEAKER 4:] I’ll have one of those, large.
[SPEAKER 5:] Ok. Anything else?
[SPEAKER 4:] No. Thanks.
[SPEAKER 5:] Great. I’ve submitted your order to Best Pizza, and it should arrive by six forty PM.
[SPEAKER 3:] Proactive purchasing options unlock the revenue potential of IoT products so your users get what they need in real time.
[SPEAKER 6:] What’s the next step?
[SPEAKER 7:] Mix in two cups of milk and three tablespoons of butter. Looks like you’re running low on butter. Should I add it to your grocery list?
[SPEAKER 6:] Yes, please.
[SPEAKER 7:] Great. Your list currently has ten items. Would you like to place a delivery order with food shoppers?
[SPEAKER 6:] Sure.
[SPEAKER 7:] Please review your shopping list and confirm your order.
[SPEAKER 6:] Order confirmed.
[SPEAKER 7:] Great. I’ve placed your order with food shoppers, and your groceries should arrive at five PM.
[SPEAKER 3:] With Houndify voice commerce, your interactive assistant can turn simple conversations into revenue generating opportunities.
[SPEAKER 8:] Will I need an umbrella in Manhattan next week?
[SPEAKER 9:] There’s no rain in the forecast next week in Manhattan. Do you need help with wider hotel accommodations for New York?
[SPEAKER 8:] Sure.
[Mike Zagorsek0:] Please show me round trip flight options from SFO to JFK departing next Monday morning and returning Thursday evening.
[SPEAKER 9:] Premier Air has the lowest price and two options for your flight itinerary. Would you like me to book one or see other airlines?
[Mike Zagorsek0:] Please book me one seat for option number two.
[SPEAKER 9:] I booked your seat with Premier Air and sent a confirmation to your email.
[SPEAKER 3:] Welcome to Houndify monetization, where voice transactions bring revenue to your business.[nonspeech:music] Unleash your earning potential and open up a world of purchasing possibilities with a custom voice assistant that boosts your bottom line. Powered by Houndify.
[Mike Zagorsek:] So two key themes in there. There were suggestions that were made very easily and seamlessly. You could refer to those as voice ads. They don’t feel like voice ads, transactions take place. There you have voice commerce. Of course, that was a very, very simplified video trying to convey the idea. There’s many layers underneath it. But the seamlessness in these moments are the ones where you were adding value, companies are getting value and, ultimately, everybody wins. So, you know, we believe that monetization is the future of AI. I mean, we’re… it allows companies to optimize their technology investments. There’s revenue share to be made between the product creators and the the service providers. It delivers in the promise of engaging voice experience. It’s adding more value if done correctly and putting the customer first. You know, unintrusive conversational suggestions, they become purchasing opportunities, again, if done correctly with the right data. New revenue streams come… increase earning potential and, ultimately, we think that voice AI should generate revenue, not cost. So company should be asking themselves, what is my revenue earnings potential by moving forward in this technology versus saying how much is it ultimately going to cost me? So with that, thank you very much. Appreciate it.
If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .