Article·AI Engineering & Research·Mar 30, 2023

How Adversarial Examples Build Resilient Machine Learning Models

Brad Nikkel
By Brad Nikkel
PublishedMar 30, 2023
UpdatedJun 13, 2024

Despite its many wonders, tweaking just a few pixels, sound waves, or words can throw artificial intelligence (AI) all off kilter—enough so for a state-of-the-art computer vision (CV) model to confuse a 3-D printed turtle for a gun, for an otherwise adept speech-to-text model to transcribe a clip of Bach’s Cello Suite 1 as the words “speech can be embedded in music”, or for a generally accurate Question-Answer (QA) model to answer the majority of “where” questions with “New York.”

Image source: A machine’s (mis)classifications of adversarial examples: red outline = classified as rifle, green outline = classified as turtle, black outline = classified as other

These slight alterations that induce AI models to produce incorrect outputs are “adversarial examples.” By “slight,” we’re talking so slight that these alterations are often unnoticeable to the human eye or ear. But, because of how current machine learning (ML) differs from human learning, these nearly invisible changes (to us) can be as vast as the Grand Canyon to ML models. When someone intentionally induces an adversarial example in a ML model, we call it an “adversarial attack.”

Since deep neural network (DNN) adversarial examples were highlighted and popularized in 2013, a maturing subfield of AI research has investigated how to make AI, especially DNNs, less vulnerable to these adversarial attacks. Much of this research focuses on discovering and defending against new exploits.

In part, this research slant toward the offensive is due to it being easier (and likely more fun) to attack AI models than it is to defend them from adversarial attacks. There are nearly infinitely many ways to produce adversarial examples, so, as with other aspects of computer security, this whole ordeal is largely a cat-and-mouse game where an exploit is detected, people scramble to address it, a new vulnerability is discovered, researchers then defend against that, and so on.

Is it the end of the world, though, if an AI model mistakes a turtle for a gun, a cello’s melody for words, or replies “New York” to every “where” question we pose it? AI models are, after all, continuous works in progress, so can’t we just go on piecemeal repairing their defects whenever we discover new adversarial examples?

Do Adversarial Examples Really Matter?

​Many adversarial examples are more amusing than they are threatening; it’s humorous, for example, when a CV model conflates a sofa for a sturgeon. And if someone were to adversarially attack something like Siri or Amazon’s recommendation engine, it typically wouldn’t inflict serious carnage. But it doesn’t take an overly vivid imagination to conjure scenarios where even adversarial attacks directed at applications as innocuous as digital personal assistants or recommendation systems could become problematic. For example, the QA model we discussed above—the one that was tricked into thinking that New York is the center of the universe—was also duped into answering most “why” questions with “to kill American people,” demonstrating that adversarial attacks can readily turn otherwise benign ML applications malignant.

And it’s quite easy to envision how adversarial attacks could wreak havoc on safety-critical ML applications like self-driving vehicles. For example, Eykholt et al. confused an autonomous vehicle CV system into misinterpreting stop signs for speed limit signs and cybersecurity researchers at McAfee fooled Teslas into accelerating to 85 miles per hour in a 35 mile per hour zone; both groups deluded machines by strategically placing a few stickers on the road signs.

Thankfully, though, adversarial attacks “in the wild” (i.e., outside research labs) are relatively rare (so far). But people are increasingly relying on AI in everyday applications, including many safety-critical systems in transportation, healthcare, banking, the legal system, and many more fields. Semi-autonomous vehicles, cancer screening, automated fraud detectors, and prison sentence recommendation algorithms are a small sample of safety-critical ML models vulnerable to adversarial attacks—not exactly the types of applications we ought to relegate to whack-a-mole, find-adversarial-examples-and-patch-them-up processes. And as AI further saturates more areas of our lives, it’s a safe bet that both adversarial attacks and the stakes of not adequately addressing them will climb.

Is it all Doom and Gloom?

​Given what adversarial attacks could do to safety critical systems, the prevailing attitude toward adversarial attacks is, understandably, to view them as threats to be stomped out. Accordingly, most adversarial ML research focuses on detecting, understanding, and defending against adversarial examples. While these are important research areas, some folks are setting aside this default, defensive mode of thinking, opting instead to flip the script and find ways to harness adversarial attacks for good.

Image source: Invisibility Cloak in action, successfully thwarting an object detection algorithm

One such use case is resisting intrusive government (and corporate) surveillance. For example, the Invisibility Cloak, a sweater developed at the University of Maryland, resists object detection algorithms that place bounding boxes around humans in surveillance footage. Similarly, AdvHat, a carefully crafted sticker placed on a stocking cap, prevents otherwise competent facial recognition systems from recognizing the AdvHat wearer’s face. Likewise, Sharif et al. developed 3-D printed eyeglass frames that grant their wearers access to facial recognition biometric systems, even tricking such systems into misclassifying the wearer as a specific person. While a step in the right direction, many of these countersurveillance adversarial examples have a glaring flaw—they’re often only designed to throw off a specific ML model.

Image source: Not exactly conspicuous yet effective adversarial example eyeglass frames

The Invisibility Cloak and AdvHat, for example, are specifically designed to evade YOLOv2’s and ArcFace’s object and face detection models, respectively. If you walk around any given urban environment, however, you’ll likely pass through the gaze of several different object and facial recognition models. Worse, many physically-worn adversarial examples are “white box” attacks, meaning their developers had significant knowledge of the ML model they were attempting to evade, a luxury most folks don’t enjoy. And if political dissidents didn’t already have enough to fret about, the Invisibility Cloak, AdvHat, and 3-D printed eyeglass frames have loud multicolored splotches reminiscent of a Vasily Kandinsky painting—certainly a peculiar enough look to make you stick out in a crowd.

Shake a (Machine Surveillance) Tail with Data Poisoning

Fawkes—a free, publicly available software developed by the University of Chicago’s Security, Algorithms, Networks, and Data (SAND) lab—employs an entirely different approach called “data poisoning,” an adversarial attack method designed to confuse a wide swath of ML models rather than a specific model.

SAND developed Fawkes specifically in response to revelations about the invasive American surveillance company Clearview AI, which notoriously scraped people’s social media images (more than 10 billion as of 2021) without their explicit knowledge or permission and then trained a massive facial detection system from those images. Operating within a legal grey area and enjoying the tacit protection that often accompanies servicing government-related contracts, Clearview AI continues raising cash from anonymous investors, despite facing significant backlash as the public caught wise to their shenanigans.

While some places (IllinoisCanada, and Europe, among others) enacted legal measures to thwart such careless use of public photos for training facial recognition models, many people worldwide lack sensible, privacy-protective legislative protection. SAND made and released Fawkes to give such people the means to resist intrusive surveillance.

How Does Fawkes Work its Magic?

​This sounds great, but how does Fawkes actually gimmick DNNs into misclassifying faces? Fawkes targets and alters a few strategic pixels in your photos—not enough for you to notice—causing facial recognition DNNs to become (mistakenly) confident that some of your nondescript facial features are what make you you. How does Fawkes determine which pixels to modify? It uses a dissimilar face as a guide. Suppose Fawkes needed to cloak Harriet Tubman’s photos from DNN models; it might search Abraham Lincoln’s photos to identify some of Abe’s features that are (to DNNs) way different from Harriet’s. Fawkes identifies which of Harriet’s pixels it can tweak so that Harriet appears most like Abe (to DNNs). To keep facial recognition models from becoming privy to their obfuscation operation, Fawkes chooses different features to alter in each of Harriet’s photo.

Image source: Fawkes’s cloaked and uncloaked images

Once Fawkes perturbs a few strategic pixels in your photos, you then upload your images, and if (or more likely, when) an unscrupulous actor scrapes your photos and feeds them as training data into a facial recognition model, your images have a shot at befuddling that model, and, in turn, you have a shot at maintaining your privacy. Fawkes is primarily geared toward muddying models yet to be trained; here it boasts over 95% protection. But SAND points out that companies routinely update facial recognition models, so even if some model currently recognizes your face, it might eventually misclassify your face if you subsequently feed that model enough perturbed images.

Endless Possibilities

​Deep neural networks may eventually develop defenses against adversarial attacks that can render Fawkes useless. Currently, though, that day seems far on the horizon. This means that for people living in societies subject to intrusive ML-based surveillance (i.e., much of the world), the average person has a fighting chance against leviathan. As Fawkes, the Invisibility Cloak, the AdvHat, and many more adversarial countermeasures demonstrate, with some experimentation, math, and daring, citizens can—at the very least—toss a few wrenches into their surveillance overlords’ ML models.

Imagine, for example, an application that automatically likes and shares a handful of specific Facebook posts, tweets, TikTok videos, and so forth so that your internet user profile looks the same to other humans but completely different to ML models. Such an application might digitally morph you from a political dissident to a model citizen or transform you from a predictable (and exploitable) consumer profile into an enigmatic black box.

For a few reasons, developing such applications would be an arduous, nearly quixotic task. First, it would require architecting and automating numerous “black box” adversarial attacks (meaning the attackers have neither direct access nor knowledge of their target ML models). Second, researchers developing such applications might ostracize themselves from corporate and government research patrons. But our world needs more courageous researchers to use adversarial attacks to joust windmills, dedicating themselves—as SAND did via Fawkes—to creating digital tools that give people a chance (even if microscopic) at masking themselves from the machine surveillance systems encroaching ever further into our lives.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.