Article·AI Engineering & Research·Jun 27, 2024

20 min read

Advanced Prompt Engineering Techniques: Tree-of-Thoughts Prompting

Learn about Tree-of-thoughts prompting, a brand new prompt engineering technique that pushes LLMs like GPT-4 to solve complex problems like crosswords, itinerary planning, and mathematical puzzles.

20 min read

By Brad NikkelAI Content Fellow

Last UpdatedJun 27, 2024

Getting Large Language Models (LLMs) to do what we want can be tough, especially for mathy, symbolic, or commonsense reasoning tasks. Chain-of-Thought (CoT) prompting—a prompt engineering technique that encourages LLMs to decompose large problems into smaller chunks—helped LLMs improve at these types of complex tasks so much that it spawned a slew of spinoffs seeking to improve on the original. But CoT and its siblings suffer from a glaring flaw—a lot hinges on that first thought. If it’s off-kilter, so is the rest of the chain.

Sometimes we (humans) need to explore several divergent threads from the start, encounter a few dead-end thoughts, rule those out, then retrace our steps back to some thought-fork we previously took before forging ahead down many untrodden paths before, eventually, finding a solution. This type of thinking resembles a tree more than a chain. Tree-of-Thoughts (ToT)—a riff off of CoT—is a prompt engineering approach that evokes this type of branching-out thinking in LLMs so that they can transcend left-to-right thinking and tackle problems in a more human-like, trial-and-error approach.

Before we get into ToT prompting, let’s clear up a minor point. Since the “X-of-Thought” naming convention is a rather limited namespace, it just so happens that two papers describing similar ToT prompting paradigms hit arxiv.org within a few days of each other. In the first paper, "Large Language Model Guided Tree-of-Thought," CTO and co-founder of Theta Labs Jieyi Long describes a rules-based control mechanism that systematically guides LLMs to solve Sudoku puzzles by treating intermediate puzzle states as “thought” nodes in a search tree. In the second paper, "Tree of Thoughts: Deliberate Problem Solving with Large Language Models," Princeton University (go Tigers!) and Google Deepmind researchers Yao et al. also fuse tree search algorithms with LLMs.

Both approaches share the same general objective: to boost LLMs’ problem-solving capacities by allowing them to explore multiple reasoning paths via trees. And both augment LLMs with similar (albeit different) modules to accomplish this goal. We’ll focus on Yao et al.'s work, however, because they tested their variant on three different tasks (a mathematics, creative writing, and crossword puzzle task), whereas Long only tested his approach on one task (Sudoku), and because Yao et al.‘s’ paper is more widely cited. With that out of the way, let’s delve into ToT.

Modeling Thinking as a Tree

ToT is supposed to mimic something like human brainstorming—creating and considering diverse ideas. But, given that LLMs are inherently autoregressive (predicting any next token based on all that token’s preceding tokens), how could they possibly rise above left-to-right thinking?

To nudge LLMs away from linear thinking, Yao et al. drew inspiration from human cognitive research that suggests humans use “dual models” of thinking: a fast, intuitive, largely unconscious mode, “System 1,” and a slow, effortful, conscious mode, “System 2”. LLMs’ default sequential thinking (standard IO prompting) is similar to “System 1," but LLMs currently lack the more deliberate “System 2”. Yao et al. used Newell and Simon’s 1950’s work—which viewed human problem solving as a tree search over a set of possibilities—as a blueprint for simulating System 2 thinking in LLMs.

What would a more arboreal (read: tree-like) LLM even look like, though? Let’s sketch it out. Suppose we ask an LLM to solve a complex problem. On Yao et al.'s tree, a “thought” is a LLM response (typically less than a few sentences), serving as an intermediate step toward solving the larger problem. Each thought is a node. Starting from the top of the tree, let’s say an LLM generates three candidate thoughts (nodes). From each of these upper three nodes (parents), our LLM could branch out further, creating lower nodes (children). A branch extending from a parent node to a child node represents a modification on the parent node (thought), leading to the child node (thought), and so on.

Using this rough cognition-as-a-tree model, Yao et al. argue that standard LLM problem-solving approaches fail on two fronts:

When LLMs are at some thought (node in a tree), they fail to expand that thought into several potential directions (the way we might consider a few different routes to bypass a traffic jam from a given point).
LLMs don’t take advantage of global shortcuts like lookahead and backtracking. We don’t need to explore every possible solution in a crossword puzzle, for example, because we can eliminate words too long to fit in too few spaces or consider how a word we have in mind might affect other words we already wrote down (and vice versa). Ideally, LLMs ought to do the same.

Yao et al. hypothesized that tree structures might remedy these problems by giving LLMs a system that, given a current thought (node), generates numerous possible next thoughts (instead of just blindly pouncing on the most likely next thought) and then weighs these candidate next thoughts’ potential against a global goal, retracing or skipping ahead steps as needed. This all seems reasonable enough to try, but how can an LLM be reworked into a tree structure?

Crafting an LLM Tree

Yao et al. used four modules to give LLMs tree-like properties:

Thought decomposer: breaks large problems into smaller steps (also present in CoT)
Thought generator: given some current thought, creates a set of different candidate next steps for solving some problem—the more divergent these are, the better
State evaluator: assess the merits of candidate ideas
Search algorithm: systematically explores a tree of thoughts (e.g. via Breadth First Search, Depth First Search, A*, etc.)

If this is your first time hearing about tree search algorithms, don’t worry; they’re fairly intuitive once you see them, and, to understand ToT, we only need a general understanding of two of them—Breadth First Search (BFS) and Depth First Search (DFS). Below is an illustration of a BFS tree exploration:

For our purposes, remember that each circle (node) is an intermediary thought that might help an LLM on its journey toward solving a complex problem. Grey represents the next set of thoughts to be considered; black represents the set of explored thoughts. Notice how the thought exploration goes as wide as possible from each exploration stage (hence the name “Bread First Search”). It goes from a → b → c → d → c → f → g → h. Contrast this with the below illustration of Depth First Search (DFS):

DFS explores as deep as possible from any given current node before backtracking to the last unexplored fork in the tree and continuing its search. Above, it goes from 1 → 2 → 3 → 4 before depleting its local options and backtracking to 1. It then explores 5 → 6 → 7 → 8 → 9 → 10 (backtracking two more times along the way). Now that we have a general idea of how an LLM and tree structure might be fused together, let’s go through Yao et al.‘s’ four tree modules in more detail.

***As a fun side-note: If you want to see an official explanation of BFS and DFS from Stanford University, check out this video from our very own Jose Francisco. He taught algorithms before he became the resident ML Developer Advocate at Deepgram 😊***

Dissecting Thoughts

Because it’s tough (and possibly intractable) to craft a thought-decomposer that’s generalizable to all tasks, Yao et al. tailored individual thought-decomposers to the three tests they gave LLMs. Deconstructing whatever strategy is required to crush crossword puzzles, for example, is not the same as slicing up creative writing into steps, which is not the same as dissecting math problems. What this typically entails is humans providing an LLM with a few decomposed examples of whatever problem type the LLM is attempting to solve.

Idea Generation

Given a current thought, Yao et al. had their LLM generate its next thoughts via two primary strategies: sampling and proposing. Let’s consider how each of these idea generators would apply to planning a trip.

Sampling Ideas

To plan a long-term sojourn via the sampling method, you’d randomly sample from a list of destinations. Suppose you picked Madagascar, Mongolia, and Montenegro. Next, for each location, you’d randomly sample from a list of possible activities. Suppose you picked mountains, museums, and music. For each of these, you’d then sample from a list of possible avenues (specific trails, museums, or concerts). Sampling’s more diverse thought generation is suited for open tasks, like creative writing, where many feasible next thoughts might branch off any given thought, or for planning open-ended travel, like our above example. But sampling wouldn’t work well for designing a time-constrained weekend trip. For that, we’d want proposal-based idea generation.

Proposing Ideas

To plan a short trip via the proposal method, you’d pick a single destination (to start out) sufficiently close to you. Let’s say your in Mexico City, so you pick Merida. Then, you’d propose an activity relevant to Merida. Let’s say, touring nearby ruins. You might then propose a specific site, a specific restaurant in the vicinity to eat at afterwards, and a specific hotel nearby. One idea strongly influences the next. After all that, you could propose another location and start the process over again. Proposal-based idea generation lends itself well to constrained thought spaces, like weekend getaways or crossword puzzles, since these types of activities contain limited viable next thoughts from any given thought (e.g., whatever words you’ve already written down in a crossword puzzle constrain your remaining options and where you located limits where you can venture to for a weekend trip).

Taking Stock (of Thoughts)

Next, LLMs need some way to weigh candidate thoughts’ utility toward their overall goal. To do this, Yao et al.’s “state evaluator” (i.e., thought evaluator) estimates how well a thought might carry an LLM toward a solution by, well, asking the LLM to evaluate its own thoughts, acting like an unprogrammed, unlearned heuristic for a search algorithm.

It might sound odd to ask an LLM to evaluate the same thoughts that it just generated, but we too generate thoughts and evaluate those thoughts (with our own thoughts)—all with the same brain—and ample research demonstrates that self-reflection and self-critiquing assist LLMs in their problem-solving capacity. So maybe it’s not so strange.

Yao et al.’s state evaluator employs the following two flavors of self-evaluation, either individually or together:

Value each thought candidate independently, meaning rating each thought (e.g., 1–10, “sure/likely/impossible”, etc.). This doesn’t need to be perfect; approximations are useful enough to simulate lookaheads or prune away obviously bad paths
Vote across candidate thoughts, meaning pick the best answer. This is like casting numerous thoughts into a multiple-choice quiz and asking an LLM to guess the best answer. This method is better when there’s no clear answer or when it’s tough to assign a scalar value (e.g., when judging how coherent a written passage sounds, it’s easier to pick the clearest passage from a set of passages than it is to map a number to each passage).

And if we’re willing to sacrifice cost and compute for even more accurate results, we can ask LLMs to self-evaluate each state several times over, using the average. Ok, now that we see how the state evaluator allows LLMs to judge their own “thinking” at any step, the last ingredient needed is some strategy for bouncing from one thought to the next (and possibly back again).

Traversing Trees: Going out on Limbs and Shimmying Straight up Trunks

Though many established algorithms exist for exploring different types of tree data structures, Yao et al. tested their ToT concept with the two fundamental tree traversal algorithms we discussed earlier, Breadth-First Search (BFS) and Depth-First Search (DFS), leaving more advanced tree search algorithms like A-Star and Monte Carlo Tree Search for later studies.

The key part is that whatever tree search algorithm an LLM uses dictates how the LLM explores thoughts. In Yao et al.’s case, from any current node, BFS explores as widely as possible and DFS explores as deeply as possible. Now that we understand the modules necessary to make a ToT, let’s see how Yao et al. tested their ToT paradigm.

Three Problem-Solving Tests for LLMs’

Yao et al. devised three challenges for testing ToT: the “Game of 24” test, a creative writing test, and a crossword puzzle. These three tasks were designed to collectively measure LLMs’ ability to use “systematic planning or search” for "deductive, mathematical, commonsense, and “lexical reasoning.” When using standard Input-Output (IO) or CoT prompting (Yao et al.’s two baselines), GPT-4, the only LLM Yao et al. tested, struggled at all three tests, suggesting that these tasks were difficult for current State of the Art (SoA) LLMs. Spoiler alert, ToT prompting performed better than the baselines at all three tasks. We’ll get into just how much better it did as we learn about each test below.

Game of 24

This first task tests LLMs’ arithmetic reasoning in a way that demands some trial and error. The Game of 24 goes like this: Given four numbers, generate a set of arithmetic operations with those numbers that equals 24. Here’s a few-shot example:

Initial numbers: {2 3 5 10}:
(2 + 10) x (5 - 3) = (12) x (2) = 24
Initial numbers: {1 2 3 4}:
(1 x 4) x (2 x 3) = (4) x (6) = 24

Notice how there’s no obvious strategy that would work for every set of four numbers; solving these requires some experimentation.

Baselines

For one baseline, Yao et al. asked GPT-4 to solve the Game of 24 with standard IO prompting (just explaining the Game of 24, providing five example problems and their expected outputs with no intermediary steps, and then asking GPT-4 to solve the 100 problems). For the other baseline, they employed CoT prompting (asking GPT-4 to solve the Game of 24 but this time providing five examples of Game of 24 problems, each broken down into three intermediate steps leading to a correct solution, similar to the above examples). Yao et al. also tested CoT with Self-Consistency (running several parallel Chains of Thought and picking the answer that showed up the most often as the solution).

ToT Setup

For their Game of 24 ToT, Yao et al. used BFS with a “propose” idea generator and a “value” idea evaluator. A single “propose” idea generator prompt suggests generating an equation with two numbers (from an initial example set of four), then swapping that result with whatever number was used, maintaining and updating a “leftover numbers” set for a total of three times before arriving at a solution. Here’s what that looks like for one of our above examples:

Initial numbers: {2 3 5 10}:
Idea: (2 + 10) = 12 ; leftover numbers: {3, 5, 12}
Idea: (5 - 3) = 2 ; leftover numbers: {2, 12}
Idea: (2 * 12) = 24; leftover numbers: { }

Rather than hardcoding anything, Yao et al. simply gave GPT-4 dissected examples like the one above, hoping the LLM would learn to decompose on its own. At each step, the LLM kept the best or top five candidate solutions (Yao et al. tested two tree variants) per step. Using the above example, (2 + 10), (2 x 10), (10 - 2), etc. could all be candidate solutions at the first step.

Then, the evaluation step asks the LLM whether a candidate thought appears “sure/maybe/impossible” to reach 24. Ideally, LLMs should be able to “lookahead” a few steps to see when some proposed solution is too big or too small, the way that we can readily see that 100, 100, 100, 100 won’t amount to 24, no matter how hard we try. Though Yao et al. didn’t mention it, you’d expect more early thoughts to fall under the “maybe” category than “sure” or “impossible,” with the “sures” and “impossibles” increasing as tree exploration increases. Here’s what the thought generation and evaluation stages look like combined:

Results

In the Game of 24 test, IO, CoT, and CoT with Self-Consistency prompting achieved 7.3%, 4%, and 9% solve rates, respectively. Not great, to say the least. ToT, however, with only one branch per node, achieved a 45% solve rate, and with five branches per node, boosted accuracy all the way up to 74%.

Also impressive is that ToT even achieved superior performance to IO and CoT prompts while visiting fewer nodes (meaning thinking fewer thoughts).

Why is CoT so bad (in comparison to ToT) at the Game of 24? Yao et al. discovered through error analysis that slightly more than 60% of CoT candidate “thoughts” failed after their first step (meaning they set out on the wrong path with no chance of finding their way because chains don’t allow “turning around” to try a different fork in the path). This implies that something about a tree structure’s ability to explore multiple thought paths and back up when necessary grants LLMs superior reasoning abilities (in comparison to chain structures).

Creative Writing

Not the first attempt to measure LLMs’ creativity (and likely not the last), Yao et al. also devised a creative writing test to check how ToT affects LLMs’ creativity. The “Creative Writing” test goes like this: given four randomly generated sentences, ask an LLM to craft a single coherent passage comprised of four paragraphs, where the last sentence of each paragraph must be one of the four randomly generated input sentences in the order they were given. By using randomwordgenerator.com to generate random sentences that lacked accompanying ground-truth text, Yao et al. eliminated the possibility that GPT-4 trained on these sentences and/or their contexts.