This is the transcript for “Voice in Healthcare,” presented by Henry O’Connell, CEO at Canary Speech, presented on day one of Project Voice X.
The transcript below has been modified by the Deepgram team for readability as a blog post, but the original Deepgram ASR-generated transcript was 94% accurate. Features like diarization, custom vocabulary (keyword boosting), redaction, punctuation, profanity filtering and numeral formatting are all available through Deepgram’s API. If you want to see if Deepgram is right for your use case, contact us.
[Henry O’Connell:] First off, I’m I’m grateful to be here. I hope I have a chance to… can you… you can hear me. Right? Ok. I hope I have a chance to speak with many of you. Canary Speech was started a little over five years ago. Historically, Jeff and I, the cofounder, Jeff Adams and myself met thirty eight years ago. I was at the National Institutes of Health doing research in neurological disease, and Jeff was at an institution we’re not supposed to name, building models to decode spy messages coming across during the cold war. Jeff’s career was much more interesting, so I’m gonna tell you about it. Jeff went on after that government job to work with Ray Kurzweil. At the time, natural language processing did not exist.
Ray was interested in building it. And Jeff built the first commercial NLP, and then built the basis in commercial… commercialized Dragon NaturallySpeaking. He worked with Nuance for a number of years, building some of their core products, and then went on to build the core products that we use every day on our cell phones for speech to text. At the time, about nine or ten years ago, Amazon was interested in building two products. One never saw the light of day, and the other one is probably the most successful speech product ever launched. It’s the Amazon Alexa product. They simply bought the company Jeff was in. Jeff Adams, seventeen speech and language scientist, and Jeff O’Neil, the patent lawyer, went on to form the core that built the Alexa product far-field speech. Approximately, a thousand patents were prosecuted, and the team graduated into about a hundred and fifty individuals in speech and language. Shortly after that, about a year after that, Jeff and I got together to create Canary Speech specifically to build the next generation of technology that could commercialize in the health care space.
So our our goal was simple. We really wanted to advance speech and language in such a way that it could be commercialized in a practical sense in the health care market, providing actionable information to clinical teams in the process of doing a diagnosis. Over this period of time Jeff and I founded it, we’ve been issued eight patents. We have six pending patents. We have patents in Japan, in Europe, and here. On our core patents, and believe me, there are thousands of speech and language patents. On our core patents in all three jurisdictions, we received a hundred percent of our claims. We believed the approach previously would never be commercialized in this space, not in the clinical space providing real-time actionable information that was accurate enough to contribute to diagnosis. We’re operating today in over a dozen hospitals, including in Mandarin, Japanese, we’re working in both England and in Ireland and in… and in the in the United States in in English. We have models built for depression, anxiety, stress, tiredness, a range of cognitive functions from cognitive, mild cognitive impairment, and Alzheimer’s. Our studies in Alzheimer’s and cognitive function are are on all major continents today. We’re functioning uncognitive in China, Japan, and in Europe. Hold the mic higher? You mean nothing I’ve said has been heard?
[SPEAKER 2:] No.
[Henry O’Connell:] Could you have waved at me earlier? Thank you. I apologize. I’ve never ever been told I talked too soft. We recently… this year, we added several key people to our team. So we added a Chief Technology Officer for the last ten years. He was in the senior technology position at Nuance. We brought on Gavin, who’s in our Dublin office now. Gavin for the last ten years was with Amazon. Caitlin came on about two years ago and has a dozen years of experience in health care. Our scientific team is run by Namhee and by Samuel. And then we just recently brought in, today actually was his first day, David. We’re currently processing millions of datasets a month. Our our rollout in Japan included ten million individuals. We’re analyzing those individuals for both stress and cognitive function in an annual health care call, backing up and and replacing the MMSI, the GAD seven, the [unsure:], and several other tests that took twenty five minutes to do. And we’re running those tests in less than two minutes from the same sample and measuring that entire range of test. What we wanted to do with the technology was to streamline the information that contributes to patient diagnosis. Diagnosis are not done by tests. Diagnosis are done by clinical teams after they have considered the information they’re provided. We’re part of that information stream. We wanted to reduce readmissions in hospitals. Well over half of the applications we have are post-discharge applications, both for hospitals and in telemedicine. We wanted to provide immediate actionable information that could be used to assess an individual’s health condition and provide clinical guidance for their welfare. From the point of which we capture an audio to the time in which it returns to device is less than two seconds. We’re processing more than twelve million data points a minute. We do that in near real time through our system and return that information to device.
So if an individual is in a telemedicine call, they’re getting guidance from us while they’re in the call. We do not identify at the word level. Everything we’re doing is at subword level. We take a model that we’ve built for depression in English. It has a performance level of better than eighty percent in Japanese. We then tweak it, qualify it, retrain it, and deploy it in Japanese. We did the same thing from there to Mandarin Chinese. This is just a diagram of kind of the flow. So today, we’re commercialized in English, Japanese, and and Mandarin. We’ll commercialize later this year in in Spanish. Our models are language agnostic, so our commercialization in various languages really has to do with deploying our tools in the language of the user. So we generally are b to b. The b to c or or b to client or b to partner engagements that we have really are for validating new models. So when we’re working with a with a a large hospital partner, we’re validating and and peer reviewing the models that we have. And also building those into the process and flow that makes them useful for the clinical teams. If you go on Apple or Android, you’ll find you’ll find Canary Speech research app. It’s not deployed to to the public. We only deploy it within a partnership.
There will, over the next couple of months, be a couple more apps. We can, in real time, measure both stress, anxiety, depression kind… we don’t call it mood because we’re actually measuring stress and anxiety. But we also have a new model for energy. So imagine I went to our scientist and I asked them if they knew the cartoon for the the series Winnie the Pooh, and they told me no. They were born and raised in Korea. And and I said, let me send you some videos. So I sent them videos, and we got back together the next day. And I said, the Winnie the Pooh series has been so popular because it represents such a wide range of human emotions. You could think of it, you know, between Eeyore and Tigger and Piglet, all the different emotions that it demonstrates. I said, when you’re talking to me, you’re aware that I’m excited about what I’m doing. I don’t say to you I’m excited about what we’re doing. There’s a musical element of speech that’s lost when you do ASR that conveys a whole range of human emotions and condition. Among the primary dataset you apply these tools, ASR is wonderful, then you get a bunch of words at the top. Those words represent the spoken language, but not necessarily the energy or the emotion that was in that when the individual spoke it.
What we’re doing is is mining the primary data layer for all of those elements and many other things. But imagine you… many of you have families. My family is five children, my wife and I, too many animals to speak about, and and I really, truly mean that we live on large acreage. We have horses and goats and chickens and things that I don’t wanna mention beyond that. But my five kids are are old enough now. They’re married, and they’re out of home. When they were in the house, and they would come home from from school after practice, whatever the sport was, their gait across the room would indicate to me much about how the day went. When they turned and looked at me, I could gather from their facial expressions whether it was a good day or a bad day. All of us do this, and all of us all of us have had this done for us. They’re out of home now. They’re all married. They’re having their own families, young children, my wife and I, our grandchildren.
When I call my daughter within moments, and I I could promise you it’s irritatingly accurate, I’ll say to her, what’s up? I know whether her day is a good day or a bad day or she’s anxious or she’s sad or she’s nervous or she’s angry. And it has nothing to do with the word she has spoken. And we know that. Doctors know that. Their whole lives center around doing that. The elements that tell us that are also part of the most complex motor function the human body produces called speech. I believe that speech is more complex than all other data produced by the human body, except the genome itself. In the twelve million data points that we analyze every minute, it’s deep in information. And we’re only scratching the surface of that. Let me give you some practical examples, and I’ll finish up. I don’t wanna talk about any of this. So Hackensack Meridian is one of the hospitals we were working with. We currently have five ongoing validations with them. I just wanna talk about one of them because it’s not one that I would have expected to be on the top of the list. Congestive heart failure. So Hackensack has between thirty five and forty thousand congestive heart failure patients annually. They have about a twenty one percent return. About sixty percent of those don’t make it. They lost nineteen point two million dollars last year in insurance payables because of the twenty one percent return.
When I talk with Elliot Frank, who is in charge of the organization, what he said to me was, Henry, there’s nothing good about this. It’s not good for the hospital. It’s not good for our patients. It’s not good for our staff. It’s… there’s nothing good about it. What they do is they post discharge, they’re calling these these patients twice a week for the first month, and then once a week thereafter. They ask them twelve questions. We took the twelve questions and we integrated them into our app. Our app is not just HIPAA compliant, which is important, but our app is externally audited for vulnerability penetration CIS and SOC two type one. And this week, it’s being submitted to the FDA after six months of discussions with them for five ten k clearance. What we did was put the twelve questions into the app. The telephone call can be made from a tablet by a clinical partner, a nurse, or or a similar individual to, let’s say, Mary at home. Mary answers the phone, and she’s now a component of a clinical call, a HIPAA-compliant environment. In addition to the twelve questions, during the conversation, we provide information on stress, anxiety, depression, tiredness, pulmonary sounds, and changes in vocal patterns, all of that in real time in the background. Changes in vocal patterns are an indicator of a coronary heart disease three days in the future. They’re frequently observed. I promise you that clinical team is listening for that.
But nursing might be making fifty calls a month to fifty different people. It’s hard for her to determine or anyone if there’s been vocal pattern changes from call one to call two to call three. Our models are perfect at it. They’re not even good. They’re perfect at it. And we can give her real-time real-time warnings that vocal patterns in Mary’s language have changed. In addition to that, we’re listening to pulmonary sounds. Changes in pulmonary sounds can be indicative of a pulmonary tract infection like pneumonia. So in real time, we’re not only telling her that that sounds are different. We’re telling her how different they are from the first call. Because maybe Mary has other problems. Maybe she smoked for fifty years. I don’t know. But we’re looking at deltas from first to second to third call, and we’re doing that in real time with them. The goal is to reduce return to admissions initially to fifteen and then to ten and then to drop it from there.
So in in very many real ways, voice is being applied to augment the dataset that our clinical teams have to serve the patient population to improve the quality of life and quality of care they receive. I wanna finish up here. We’re deploying, as you can imagine, in telemedicine as well for a whole range of different types of applications. Our belief has been that in order to make this a practical solution for the health care environment, it had to be specific. It had to be accurate. It had to be fast real time. And it had to provide actionable information that could make a difference in the treatment of patients. The best way to do that was to partner with literally dozens of health care facilities around the world. We’re fortunate that when we talk with groups like this, voice is understood to be a valuable dataset that they can implement in the… their day-to-day treatment of patients. I wanna thank you for your time. I’m certainly here through Wednesday. If anybody would like to talk, I… and I’m terrible at it. If you approach me, I’ll probably keep you longer than you wanna stay. Just wink or something, and I’ll shut up and let you go. So thank you very much, and please take care. Thank you very much.