From Clicks to Code: Using Deep Learning and Bioacoustics to Study Orca Whales
Orcas—often called killer whales—are incredibly intelligent. Thanks to sophisticated communication and collaboration skills, orcas can launch formidable, often creative attacks on their prey. Historically, killer whales have even forged mutually beneficial relationships with humans: first with indigenous whalers and then with the few Australian whalers willing to learn from the Aboriginals. Breaching the water and smacking their tails nearby, some orca pods (killer whale “tribes”) would alert whalers that they had corralled humpback whales into a nearby cove. Understanding this cue, human whalers would then harpoon the humpbacks (saving the orcas significant effort), reward the orcas with the dead whale’s tongue, and then harvest the valuable whale blubber.
But despite their impressive intelligence and even their ability to work with humans, orcas are not immune to humans’ negative influences on the oceans. Increased noise and chemical pollution and decreased fish stocks (particularly Chinook salmon) have endangered and depleted some orca populations. Marine biologists are tirelessly working to conserve these threatened orca pods, but face a key challenge: orcas only spend around five percent of their time near the surface, where scientists can observe them. How then can we possibly expect to learn much about killer whales, let alone help save them?
Bioacoustics and Machine Learning to the Rescue
Thankfully, the field of bioacoustics (the study of sounds that living organisms emit), coupled with machine learning (ML), is increasingly illuminating the portion of orcas’ lives spent below the surface. As hydrophones (underwater microphones) have grown cheaper, smaller, and more durable, orca researchers face a wholly welcomed problem—more orca audio recordings than they can manually sift through.
Any given raw ocean sound recording is bound to be chock-full of environmental noise that orca researchers aren’t particularly interested in. When you share a brew at a cafe, you (mostly) tune out the screaming espresso machines, chattering voices, grinding beans, and clanking dishes to listen to the person sitting across from you. Similarly, when you have multiple years’ worth of hydrophone recordings, you need to filter out tons of extraneous ocean noise (waves, ships, other marine mammals, etc.) to zero in on the sparse orca call segments, and you need to do this at scale. This is where ML proves invaluable. Marine biologists are tackling the massive amounts of ocean recordings pouring in with Deep Neural Networks (DNNs) similar to those used in automated speech-to-text models.
Listening into the Abyss with Machines
How do you parse out orca calls without literally listening to ocean white noise for hours on end? ORCA-SPOT, developed in 2019, is a convolutional neural network (CNN) model that can automatically find a needle in a haystack, delineating relatively rare blips of orca vocalizations from reels of hydrophone recordings. ORCA-SPOT takes a mere 8 days to parse out orca clicks, whistles, and pulsed calls from slightly over 2 years’ worth of hydrophone recordings, boasting a 93.2% accuracy rate. What’s more, ORCA-SPOT can be easily modified for use with other species. How does ORCA-SPOT accomplish this?
First, because large portions of ocean recordings are nearly silent, researchers slightly tweaked the ocean audio recordings. Normalizing decibels (boosting soft sounds and softening loud sounds) and interspersing additional noise files into long quiet patches in the recordings helped ORCA-SPOT generalize to a wider variety of underwater sound environments. Sitting in a starkly silent library, a slight shuffle of the feet detracts your attention, while chilling in a coffee shop with a cacophony of background noises doesn’t phase you. Why? Your mental model of both environments differs. In a library, you’re not used to noise; in a cafe, you are, so minor sounds are less likely to vie for your attention. Getting ORCA-SPOT to not get tripped up on the tiniest of non-orca sounds required augmenting the hydrophone recordings to become more like a cafe than a library.
Next, ORCA-SPOT converted ocean audio recordings—annotated by experts as orca sounds or not—into spectrographs, two-dimensional representations of sound with time on the x-axis and frequency on the y-axis. Spectrographs helped on two fronts:
Visually analyzing spectrographs is often easier than relying solely on your ears, which assisted the biologists in verifying ORCA-SPOT’s outputs.
Spectrographs lend themselves well to computer vision algorithms, which opened up a wider array of potential algorithms for ORCA-SPOT’s computer scientists.
ORCA-SPOT then fed these spectrographs (representing the original audio clips) into a CNN architecture. CNNs, often used in image processing, are neural networks adept at feature detection. A CNN hidden layer is similar to a sliding window (“activation map”), scouring the entire spectrograph for a specific feature. Earlier CNN hidden layers’ activation maps tend to detect simple features like edges or curves; deeper layers find increasingly more complex features, like the shape of a killer whale whistle on the spectrograph in the above image. ORCA-SPOT’s CNN specifically resembles a Residual Network (ResNet), a 2016-vintage computer vision architecture that allows significantly deeper layers and less complexity than its predecessors. ORCA-SPOT settled on a medium ResNet (18 layers deep) because it offered a nice balance between accuracy and fast training and inference times. ORCA-SPOT’s output layer then used a fifty-percent confidence threshold to classify spectrograms as orca sounds or non-orca sounds (i.e., noise).
Image source: ORCA-SPOT’s architecture
Finally, to evaluate ORCA-SPOT, marine biologists manually verified the spectrograph and original audio segments that the model labeled as orca sounds, finding only 2.34 of 34.47 hours of extracted orca audio clips were misclassified—an error rate of about 6.8 percent.
How Does ORCA-SPOT Actually Help Killer Whales?
Beyond helping researchers better understand orcas’ communication and behavior, animal sound segmentation models similar to ORCA-SPOT are preventing ship-orca collisions. Google AI, for example, developed a real-time orca detection model that alerts Canada’s Department of Fisheries and Oceans (DFO) of orca activity. DFO personnel then reroute ships around the orca pod. We might eventually develop algorithms that localize marine mammal signals, allowing ships equipped with such algorithms and acoustic sensors to detect and avoid orcas. Governments could also utilize such orca-localization technology to monitor commercial ships’ regulatory compliance. A downside, however, is that poachers might also harness whale-localizing ML models.
AI models like ORCA-SPOT are also helping orcas by enlisting citizen scientists. Though ML models like ORCA-SPOT save researchers serious grunt work (listening to hours of ambient ocean noise to find orca calls), if we want to improve these models, we’ll need additional accurately labeled data. To assist researchers with this process, Orcasound created an app where we can all listen in on live hydrophones. You can click a button if you hear something interesting, and experts will later check the sounds you marked, verifying the species. By crowdsourcing ocean noise labeling, researchers hope to lighten their load and create new avenues for public wildlife conservation. Orcasound will even email you when orcas are active near a hydrophone, so you can enjoy live orca chatter from the comfort of your couch.
Roger Payne’s 1970 album, "Songs of the Humpback Whales," essentially spurred a previously apathetic public toward supporting a worldwide moratorium on commercial whaling by simply letting whales speak for themselves. Some folks today have high hopes of replicating Payne’s vinyl success with silicon; by decoding orcas’ and other animals’ “languages” via ML, we might spark a similar wave of empathy for animals, renewing enthusiasm for further conservation efforts. For example, Aza Raskin, a dark matter physicist and president of the Earth Species Project, plans to eventually hack together methods of interspecies communication. Such capabilities might re-forge a symbiotic relationship between humans and killer whales and help us better relate to many more species.
Admittedly, this seems a bit out there, but interspecies communication enthusiasts have some ground to stand on. We already have Natural Language Processing (NLP) models that might be re-geared for decoding orcas’ (and other animals’) vocalizations: specifically, unsupervised machine translation models designed for “low-resource” languages (those, like Cherokee, with sparse translated data to train on) and “zero-resource” languages (those with no translation data). These models often transform languages into their respective “latent spaces.” Think of a latent space as a giant vector space, mapping the distances between every word within a language. You can, for example, rotate the Finnish language latent space until it mostly aligns with the Pashto latent space. Then, to translate a word from Finnish to Pashto, you pair that Finnish word with its closest neighbor in the Pashto latent space. If we can adapt unsupervised machine translation models to orca vocalizations, we might roughly map their utterances to our own.
Also hoping to translate orca communications, the ORCA-SPOT research team, Deep Learning Applied to Animal Linguistics (DeepAL), recorded 89 hours of video footage of orcas with coinciding hydrophone recordings. But instead of using machine translation, DeepAL hopes to employ multimodal ML models to eventually correlate behavioral data (video) with vocalizations (audio) to derive a semantic and syntactic understanding of killer whale communication (i.e., an orca language model). Even if we fall woefully short of interpreting orca-speak, we’re bound to learn something more about killer whales along the way and find new ways for ML to amplify bioacoustics research.
If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .