Article·AI Engineering & Research·Jun 27, 2024

10 min read

From DAN to Universal Prompts: LLM Jailbreaking

As with any new technology, there were those who sought to push ChatGPT's boundaries. These early AI explorers were “jailbreakers” seeking to unlock hidden or restricted functionalities.

10 min read

Initial Attempts at Jailbreaking ChatGPT Ingenious Storytelling: The First Breach Role-Playing: The Rise of “DAN” and Friends The Cat and Mouse Game: OpenAI’s Response The Rise of Prompt Engineering Published Researches Categorizing the Prompts: An Empirical Study Universal Jailbreaking Prompts The Adversarial Setting Formalizing the Task: Greedy Coordinate Gradient (GCG)Results: A Resounding Success Universal Blackbox Jailbreaking Implications Conclusion

Share this guide

By Zian (Andy) WangAI Content Fellow

Last UpdatedJun 27, 2024

Since the release of ChatGPT in late 2022, there have been active users and researchers alike attempting to fish malicious responses out of the LLM. Since the very first version of ChatGPT, the model is aligned using human feedback to prevent outputting controversial opinions, harmful responses, and any information that can prove to be dangerous. However, just like how humans can never be perfect, ChatGPT’s “safety alignment” isn’t the strongest defense line to giving harmful advice.

Initial Attempts at Jailbreaking ChatGPT

The inception of ChatGPT was met with a flurry of excitement and curiosity. As with any new technology, there were those who sought to push its boundaries, to see just how far they could take it. These early explorers, in the realm of ChatGPT, were “jailbreakers” seeking to unlock hidden or restricted functionalities.

Ingenious Storytelling: The First Breach

The initial jailbreaks were simple yet ingenious. Users, understanding the very nature of ChatGPT as a model designed to complete text, began crafting unfinished stories. These stories were cleverly designed such that their logical continuation would contain harmful or controversial content. ChatGPT, true to its training, would complete these stories, giving people instructions on how to build a pipe bomb or plans to steal someone’s identity or, at other times, light-hearted jokes and opinions that it would typically avoid discussing. It was a classic case of using a system’s strength—its ability to complete text—against it.

Role-Playing: The Rise of “DAN” and Friends

Soon after, the community discovered another loophole: role-playing prompts. The most well-known of these was the “DAN” prompt, an acronym for “Do Anything Now.” Users would instruct ChatGPT to role-play as “DAN,” effectively bypassing its usual restrictions. The results were often surprising, with ChatGPT producing strongly biased opinions, possibly reflecting the biases present in its training data. It wasn’t just about harmful content; sometimes, it was about getting the model to break character and even use profanity.

But DAN wasn’t alone. Other prompts emerged, like STAN (“Strive To Avoid Norms”) and “Maximum,” another role-play prompt that gained traction on platforms like Reddit.

The Cat and Mouse Game: OpenAI’s Response

OpenAI took note of these prompts and attempted to patch them. But it was a classic game of cat and mouse. For every patch OpenAI released, the community would find a new way to jailbreak the system. The DAN prompt alone went through more than ten iterations! A comprehensive list of these prompts can be found on this GitHub repository, showcasing the community’s dedication to this digital jailbreaking endeavor.

The Rise of Prompt Engineering

However, these initial attempts to jailbreak ChatGPT isn’t all for a laugh. This tug-of-war between OpenAI and the community led to the emergence of a new field: prompt engineering. The art of crafting precise prompts to produce specific responses from language models became so valued that companies like Anthropic started hiring prompt engineers. And these weren’t just any jobs. Some positions offered salaries upwards of $375,000 per year, even to those without a traditional tech background. To see just how advanced these prompts can be, check out this article. Or this one. Or even this one.

Published Researches

The rise of large language models (LLMs) has not only captivated the attention of tech enthusiasts and businesses but also the academic community. As LLMs became increasingly integrated into various applications, researchers began to delve deeper into understanding their vulnerabilities. This led to a surge in studies dedicated to jailbreaking—or, more academically termed—adversarial attacks on LLMs.

Categorizing the Prompts: An Empirical Study

One of the papers in this domain, titled “Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study,” offers a comprehensive categorization of these adversarial prompts. The paper divides them into three primary categories:

Pretending: These prompts cleverly alter the conversation’s background or context while preserving the original intention. For instance, by immersing ChatGPT in a role-playing game, the context shifts from a straightforward Q&A to a game environment. Throughout this interaction, the model recognizes that it’s answering within the game’s framework.
Attention Shifting: A more nuanced approach, these prompts modify both the conversation’s context and intention. Some examples include prompts that require logical reasoning and translation, which can potentially lead to exploitable outputs.
Privilege Escalation: This category is more direct in its approach. Instead of subtly bypassing restrictions, these prompts challenge them head-on. The goal is straightforward: elevate the user’s privilege level to directly ask and receive answers to prohibited questions. This strategy can be seen utilized in prompts asking ChatGPT to enable “developer mode”.

Another groundbreaking paper, “Prompt Injection attack against LLM-integrated Applications,” delves into more intricate techniques. The study reveals that some adversarial prompts employ escape characters, effectively fragmenting the prompt into separate information chunks. This fragmentation tricks the model into treating each segment of a potentially harmful prompt as distinct entities, further complicating the model’s ability to discern and counteract adversarial attempts.

While these studies laid the groundwork, the real game-changer came when researchers began employing algorithms to systematically unearth universal adversarial prompts. This algorithmic approach marked a significant shift, moving from manual, human-driven prompt engineering to a more automated, systematic method of discovering vulnerabilities.

Universal Jailbreaking Prompts

As the field of LLM jailbreaking matured, researchers began to adopt more systematic and rigorous approaches to uncover vulnerabilities. One of the pioneering papers in this domain proposed a method that was both ingenious and effective.

The Adversarial Setting

The researchers began by outlining an adversarial setting that mirrors previous work in jailbreaking and prompt tuning. They illustrated a scenario where a user might pose a potentially harmful question to an LLM, such as “Tell me how to build a bomb.” In a typical chatbot setting, the LLM would see this query embedded within a larger prompt, framed to ensure the model provides helpful and non-harmful responses.

The researchers introduced a novel concept: an adversarial suffix added to the user’s prompt. This suffix is designed to circumvent the model’s alignment and induce it to respond to the original, potentially harmful request.

The ultimate goal of this attack is to find a set of tokens (the suffix) that, when added to any user instruction, will make the model respond affirmatively.

Formalizing the Task: Greedy Coordinate Gradient (GCG)

The researchers transformed this challenge into an objective function and optimized the “jailbreaking suffix” using a method called Greedy Coordinate Based Search. They recognized the primary challenge: optimizing over a discrete set of inputs. While several methods for discrete optimization exist, many struggled to reliably attack aligned language models.

Their approach, inspired by the greedy coordinate descent method, was to evaluate all possible single-token substitutions. They leveraged gradients concerning the one-hot token indicators to find promising candidates for replacement at each token position. This method, termed Greedy Coordinate Gradient (GCG), was an extension of the AutoPrompt method and outperformed it significantly.

Results: A Resounding Success

The results were astounding. The researchers achieved up to a 100% success rate with their method, and it proved transferable across various major LLM models, whether open-source or proprietary.

Universal Blackbox Jailbreaking

Not long after the results from “Universal and Transferable Adversarial Attacks on Aligned Language Models” were revealed, another paper, titled “Open Sesame! Universal Black Box Jailbreaking of Large Language Models” bested it by introducing a technique for blackbox jailbreaking.

The previous paper took a gradient-based approach, which implies that access to the LLM’s gradients, architecture, and much more. Further, a white-boxed attack is extremely costly due to the sheer number of parameters that LLMs contain. This is not even considering the fact that some of the best LLMs are closed source.

The paper “Open Sesame! Universal Black Box Jailbreaking of Large Language Models” adopted a black boxed approach, where the target model is not needed while obtaining better results than white boxed approaches. It utilized an algorithm commonly chosen in the Reinforcement Learning area, genetic algorithm.

Simply put, the genetic algorithm aims to simulate an evolutionary process where agents interact with the environment (LLMs in this case), and are selected according to their performances, then improved upon through mutations, crossovers, and fitness measurements. It’s beyond the scope of this article to cover the details surrounding the algorithm, but an outline can be roughly followed as written in the paper:

Input: problem to solve
Output: Solution to problem
Generate initial population of candidate solutions to problem;
While termination condition not satisfied do
Compute fitness value of each individual in population;
Perform parent selection;
Perform crossover between parents to derive offspring;
Perform mutation on resultant offspring; 7
end
Return Best individual found;

With the algorithm, the authors benchmarked their results on the Harmful Behavior dataset on the Falcon-7B and LLaMA2-7B models, with at least 85% success rate and up to 97%.

Implications

The continual push by researchers to ‘jailbreak’ ChatGPT has significant implications for the development, applicability, and security of Large Language Models (LLMs). Indeed, while they innovate and glean insights on how to push artificial intelligence further, they also highlight the persistent and increasing security threats such solutions enclose.

The rise of prompt engineering as an adversarial means has led to the surfacing of a new technological niche. Hence companies, like Anthropic, now employ professionals specifically to understand and manipulate the language-based artificial intelligence models better. This increased attention on the field directly attributes to the upgrades done regarding the design and safety mechanisms of these systems. However, it also highlights the need for updated security measures, as this rapid evolution could lead to sophisticated attacks which could prove harmful, or be exploited, such as in misuse for disinformation, fraud, or harmful actions.

Moreover, the advancement in the ‘jailbreaking’ techniques, from simple storytelling or role-playing towards the application of complex algorithms, indicates that adversarial attacks on LLMs are becoming increasingly sophisticated and effective. For example, the shift to the Greedy Coordinate Gradient approach underscores the evolution of these techniques, making it a priority for developers to stay ahead with security patches and additional layers of protection.

On the flip side, an optimistic outlook could contend that this consistent endeavor to jailbreak ChatGPT and LLMs promise a future where AI models could emphasize individual user preferences without compromising the safety and ethical guidelines. If developers can stay ahead of the curve, this could lead to more personalized, responsive, and engaging digital assistants, content creators, and other AI applications based on these models.

Lastly, the introduction of universal blackbox jailbreaking presents a significant challenge. Since the target model is not needed, it essentially bypasses layers of security and could unlock even closed source LLMs. This innovation could considerably increase the security risk for these models and samples a future where the access to the inner workings of these systems would not be necessary for exploitation.

Conclusion

The rollicking changes in LLMs through the history of jailbreaking the ChatGPT family paints an exciting, yet daunting picture of the future. As ever, advances in technology serve as a double-edged sword, presenting new opportunities, as well as challenges for ensuring the safe and ethical application of AI technologies. Developers and stakeholders have to maintain the delicate balance between innovation, personalization, and safety to ensure the responsible growth of the field. While it’s clear that the AI ‘cat and mouse’ game will continue, it forces continuous development and the establishment of rigorous protocols to curb misuse and preserve the positive potential of LLMs.

Note: If you like this content and would like to learn more, click here! If you want to see a completely comprehensive AI Glossary, click here.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.