The Art of AI: Crafting Masterpieces with Prompt Engineering and LLMs
Jose Nicholas Francisco
I heard the word “Prompt Engineering” for the first time about a month after ChatGPT came out. Initially, I thought it was a joke. After all, how could you “engineer” a prompt? In my mind, engineering involves striking a nail with a hammer to ensure that a bridge remains stationary in the face of a hurricane. In its most sedentary form, engineering should entail Python frameworks, and Redis, and firewalls, and Kubernetes.
How could you possibly be an engineer when your only tool is natural language?
Well, it turns out that natural language is a hell of a tool. And I was naive to think any less of it. Shame on me, honestly. After all, I’m a writer. From Thomas Paine’s Common Sense to the Treaty of Paris (any one of them), words have the ability to incite revolutions and end wars.
And today, we’ll be using them to squeeze the most juice out of an AI as we can.
The pen is mightier than the sword… but what about hammers and nails?
To understand prompt engineering, we first need to view words as a tool. That’s simple enough to digest at a high level, but when it comes to applying this mentality in the field, the details become rather nuanced.
Let’s be honest. We only talk to AI for one reason: To make it generate some desired output. Whether it’s art or code, the goal of communicating with any model (at this point in time, at least) is to make it produce. And prompts are the only way we can control this output without cracking the AI open and modifying its weights.
However, the art and text that AI produces can only be as good as the prompt that the user input. Or, more specifically, the AI’s output is only as good as its prompter. That’s you.
Much like a carpenter, a prompt engineer’s skill manifests itself in the quality of their work. A mediocre prompt engineer creates mediocre art/writing/code. Meanwhile, a great one creates great ones. Take a look at the image below, for example. The image on the left was created with Stable Diffusion, with the mere prompt of “A red bird flying through the sky.”
Sure, the resulting image is indeed a red bird flying through the sky. But the actual image itself is rather lackluster. The bird looks realistic while the sky looks cartoony. The bird’s tail seems to be positioned as if it was a pair of wings. And the bird’s actual wings are pressed against its body as if it were grounded and stationary.
With a better word-choice, we can achieve the image on the right. A bird with a single tail and two clearly defined wings flapping through the air. The sky and the bird itself actually match each other, this time—the wispy brush strokes of the clouds complementing the similar brush strokes of the bird’s feathers. And, of course, our well-engineered bird rocks a beautiful set of head feathers.
So, what magic words did we use to turn a mediocre image into a more cohesive one? Let’s delve into it:
The Parts of a Prompt
Alright, let’s make sure we’re all on the same page. First thing’s first, the formal definition of a prompts: Prompts are instructions given to an LLM to enforce rules, automate processes, and ensure specific qualities (and quantities) of generated output.
And prompts should have some (non-empty) combination of the following elements: Instructions, Questions, Input Data, and Examples.
And what’s cool about these elements is that they don’t take on any special definition in the prompt-engineering space. That is, a Question in the context of prompt engineering is the same thing as a Question in daily life. It’s just a question. Nothing more complex than that. And the same goes for instructions, input data, and examples.
So with that in mind, let’s take a look at the different flavors of prompt engineering.
The Flavors of Prompt Engineering
There are literally dozens of resources that point out the different flavors of prompt engineering. From quantitative researchers to OpenAI engineers to academics and their respective institutions, here are the ones that they all touch on.
This is probably the most basic that a prompt can get. It’s literally just giving instructions to the LLM. For example, if you want an LLM to engage in Sentiment Analysis, you’d simply supply the text you want it to analyze alongside an instruction saying “Label the Sentiment of this sentence as positive or negative.”
Or maybe something like, “Given the following bullet-point information about me, write a 650-word college essay. That being said, a language model is not a search engine and it will fail to give you actionable outputs if your inputs are 2-3 words long. Thus, to make your Instruction Prompting as effective as possible, use verbs in your prompt: Translate, Interpret, Compose/Write/Create, Explain.
In fact, we even used AI to attain a list of effective Instruction Prompting Verbs.
Here are the Top Ten Instruction Prompting Verbs that our team elicited from GPT-4: Ask, Explain, Summarize, Generate, Translate, Predict, Suggest, Compare, Define, Describe.
By using commanding, imperative verbs, LLMs will be more likely to produce desired outputs, especially if the LLM in question was finetuned for instructions itself, like InstructGPT and NaturalInstruction. In fact, I used “Instruction Prompting” to create our YouTube channel’s most-viewed Short. It’s my face reciting some AI-generated words. Note that this video was a zero-shot example—as opposed to few-shot. And as we can see, the results were pretty effective!
Persona Pattern Prompting
This is the type of prompting where you tell the AI to mimic a particular persona. For example, we can say: "You are a perceptive editor with experience at The New Yorker and Harper's Magazine. I'll submit 3 paragraphs in my next message, and I need help rephrasing the second paragraph. Take the first and final paragraphs into consideration, since they provide important context. Got it? Let's go."
Here, we operate under the assumption that the AI has indeed read The New Yorker and Harper's Magazine. And given the dataset that many of these LLMs are trained on, this assumption seems pretty fair. (Take a look at “The Pile” dataset below for a glimpse at the depths of data that a modern-day LLM has to read, at minimum.)
Note that the results of Persona Pattern Prompting rely heavily on the way you describe the persona. For example, if we’re editing a section of some article we’re writing, we’d get a different output if we say “You’re an editor from The New York Times” than if we said “You’re an editor from MAD Magazine.”
Likewise, we’d get different outputs if we said “You’re the Editor-in-Chief at Fox News” versus “You’re the Editor-in-Chief at CNN.”
And here’s the cool part about Persona-pattern prompting: It’s such a popular technique that researchers working on the latest LLMs are taking it into account when designing new pretraining methods. In fact, Meta, when designing Llama-2, invented a new technique called “Ghost Attention” that assumes users will use their chatbots in this way.
They even made the Llama-2 chatbot undergo some serious persona testing before releasing it into the public. In their 78-page research paper, Meta spends a good amount of time showcasing how incredible Llama-2 is at impersonating Oscar Wilde. Check out the results of that testing in the image below, or in Figure 10 of the Llama-2 paper, here.
Personally, I use persona prompting to create titles for my content. That is, I feed a very specially designed prompt to an AI in order to create the titles for my articles and YouTube videos. In fact, even the click-baity thumbnail text comes from LLMs. See some examples below.
Note that our YouTube channel is relatively young, so we’re doing all we can to make our content as discoverable as possible. What’s cool is that the videos with AI-generated titles, thumbnails, and clickbait actually perform better than the ones on our channel that don’t.
In Chain-of-thought prompting, we force the model to explain its reasoning when answering a question. In this classic example from Lilian Weng (image below), we show an AI how to walk through the steps of a simple math problem. Basically, we force the AI to show its work rather than giving a straight answer.
The first two examples about hill-climbing and soccer are human-written examples that the LLM will use to guide its output generation for the final question about ribbon-cutting.
Personally, I use CoT prompting more often than any other form of prompt engineering.
Why? Because it’s how I debug my code.
You may have seen this video we made in the past. In it, I test which LLM is the best at coding—Github Copilot, StarCoder, ChatGPT, or Gorilla. And when I tested the models to debug some faulty code, I used Chain-of-Thought engineering to force it to explain where the error is and how to fix it.
For example, let’s look at how ChatGPT fixes faulty Python code.
In this video, I challenge various LLMs to write code that removes all the vowels from a user-input string. It turns out, ChatGPT did a pretty good job. However, Copilot and StarCoder far surpass ChatGPT in code-generation. Where our beloved chatbot shines, however, is in debugging faulty code.
To test it, I submitted a buggy solution to this problem to ChatGPT. That “solution,” by the way, looks like this:
Found the bug? It’s kinda subtle. If you want to solve for what the bug is yourself, feel free to parse the code before reading ChatGPT’s breakdown.
Long story short, the breakdown was perfect. Not only did ChatGPT tell me what the bug was, but through (zero-shot) Chain-of-Thought prompting, it was also able to fix it.
And by the way, if you want to see which AI is indeed the best at coding, check out the video itself 😉.
Okay, now that we know about CoT decoding, we can build off of it to arrive at “Self-Consistency Decoding.”
This method performs several CoT rollouts, then selects the most commonly reached conclusion out of all the rollouts. If the rollouts disagree by a lot, a human can be queried for the a correct chain of thought. That is, you just do CoT prompting repeatedly and pick the best result.
The way you pick the best result—that is, the criteria you’re using, the algorithms you’re employing, and the examples you’re picking from—can vary from task to task.
But here’s the punchline: “Self-Consistency Decoding significantly improves accuracy in a range of arithmetic and commonsense reasoning tasks, across four large language models with varying scales.”
That being said, one limitation of self-consistency is that it incurs more computation cost. In practice people can try a small number of paths (e.g., 5 or 10) as a starting point to realize most of the gains while not incurring too much cost, as in most cases the performance saturates quickly, as seen in this figure from the paper (Figure 2).
Prompting for Image Generation
Finally, let’s talk about Prompting for Image Generation.
As of right now, if you want to use text-to-image generators like StableDiffusion or Dall-E, you’re stuck with a single prompt and the inability to give few-shot examples. However, here are some resources I’ve found to help you with generating images.
Let’s use those resources right now to see how we arrived at the two red-bird images from earlier.
We’ll start with a basic prompt and iterate on it. As mentioned before, the initial, no-effort, barebones prompt was “A red bird flying through the sky.” And the image produced was our strange cartoon phoenix.
We can do better than that. First, let’s talk about word-choice.
In this Github Repo, we find words that we should use when prompting an AI art-generator (see image below). These are words that historically have performed well when other AI artists needed to create fantastical images for anything from logos, to D&D games, and even to gaining Reddit karma.
So with that in mind, let’s add some of those words to our prompt as follows:
All I did was take one term from each of the categories mentioned in the github repo’s lists and concatenate them. The resulting bird looks like this:
Okay, not bad! The sky looks more realistic now, and the tail is no longer split in two. However, the wings are a bit awkward. It seems we have one wing pressed against the right side of the bird’s body and another wing mid-flap that also seems to stem from the right side of the bird’s body.
We’ve improved, but there’s still room for improvement. Let’s keep going.
Of course, we can just continue iterating on our prompt, adding more and more words. However, there are some additional techniques we can take advantage of. Namely, weights and negative prompts.
Weights allow you to dictate to the AI image generator exactly how heavily you want each item in our list of words to be valued. You can see a classic example of this in the github repo linked above. If we want to create an animal that’s 70% Shiba Inu and 30% Polar Bear, we can simply include in our prompt “A hybrid between a Shiba inu:0.7 and a polar bear.”
Since weights must add up to 1.0, the polar bear’s 0.3 coefficient is implied.
Now, let’s discuss negative prompts. Basically, if a prompt is a list of terms you’d want the AI to take into consideration when generating an image, a negative prompt tells the AI exactly what you want the image NOT to be.
If you want your image to be beautiful, then include “ugly” in your negative prompt. If you want it to be realistic, then include “unrealistic” in your negative prompt. As a result, our prompt now becomes this:
And the resulting image looks like this:
A few notes on this image: The 0.95 weights added to the prompt was the result of a manual hyperparameter search that I conducted. I wanted our image to avoid delving into uncanny valley territory while still giving off the same energy as a realistic painting.
Note that there is indeed one final “technique” you can use: Stable Doodle. This app allows you to supply not just words but also a sketch to the AI. As a result, you can combine all the previous prompt-engineering techniques with the sketch to yield an image more specific than plain, out-of-the-box StableDiffusion can provide.
The example below illustrates the power of Stable Doodle and castle-drawing, for example.
And when you combine Stable Doodle with the techniques listed above for our red bird example, you arrive at this image:
Admittedly, I like the non-Stable-Doodle image a little better, but hey, we can always iterate!
It turns out that words are a valid engineering tool. A hammer is only as effective as the carpenter holding it. The same goes for a chef with his pan, a SWE with her keyboard, and an AI enthusiast with their LLM.
Hopefully you’ve learned something new from this article. And if you have another prompt engineering technique that wasn’t listed, don’t hesitate to share! Our social media is always open to DMs. And speaking of which…