Getting Large Language Models (LLMs) to do what we want can be tough, especially for mathy, symbolic, or commonsense reasoning tasks. Chain-of-Thought (CoT) prompting—a prompt engineering technique that encourages LLMs to decompose large problems into smaller chunks—helped LLMs improve at these types of complex tasks so much that it spawned a slew of spinoffs seeking to improve on the original. But CoT and its siblings suffer from a glaring flaw—a lot hinges on that first thought. If it’s off-kilter, so is the rest of the chain.

Sometimes we (humans) need to explore several divergent threads from the start, encounter a few dead-end thoughts, rule those out, then retrace our steps back to some thought-fork we previously took before forging ahead down many untrodden paths before, eventually, finding a solution. This type of thinking resembles a tree more than a chain. Tree-of-Thoughts (ToT)—a riff off of CoT—is a prompt engineering approach that evokes this type of branching-out thinking in LLMs so that they can transcend left-to-right thinking and tackle problems in a more human-like, trial-and-error approach.

Image Source: Yao et al. ToT compared with standard Input-Output (IO) and CoT prompting; Squares = LLM “thoughts” (i.e., intermediate responses)

Image Source: Yao et al. ToT compared with standard Input-Output (IO) and CoT prompting; Squares = LLM “thoughts” (i.e., intermediate responses)

Before we get into ToT prompting, let’s clear up a minor point. Since the “X-of-Thought” naming convention is a rather limited namespace, it just so happens that two papers describing similar ToT prompting paradigms hit arxiv.org within a few days of each other. In the first paper, "Large Language Model Guided Tree-of-Thought," CTO and co-founder of Theta Labs Jieyi Long describes a rules-based control mechanism that systematically guides LLMs to solve Sudoku puzzles by treating intermediate puzzle states as “thought” nodes in a search tree. In the second paper, "Tree of Thoughts: Deliberate Problem Solving with Large Language Models," Princeton University (go Tigers!) and Google Deepmind researchers Yao et al. also fuse tree search algorithms with LLMs.

Both approaches share the same general objective: to boost LLMs’ problem-solving capacities by allowing them to explore multiple reasoning paths via trees. And both augment LLMs with similar (albeit different) modules to accomplish this goal. We’ll focus on Yao et al.'s work, however, because they tested their variant on three different tasks (a mathematics, creative writing, and crossword puzzle task), whereas Long only tested his approach on one task (Sudoku), and because Yao et al.‘s’ paper is more widely cited. With that out of the way, let’s delve into ToT.

Modeling Thinking as a Tree

ToT is supposed to mimic something like human brainstorming—creating and considering diverse ideas. But, given that LLMs are inherently autoregressive (predicting any next token based on all that token’s preceding tokens), how could they possibly rise above left-to-right thinking?

To nudge LLMs away from linear thinking, Yao et al. drew inspiration from human cognitive research that suggests humans use “dual models” of thinking: a fast, intuitive, largely unconscious mode, “System 1,” and a slow, effortful, conscious mode, “System 2”. LLMs’ default sequential thinking (standard IO prompting) is similar to “System 1," but LLMs currently lack the more deliberate “System 2”. Yao et al. used Newell and Simon’s 1950’s work—which viewed human problem solving as a tree search over a set of possibilities—as a blueprint for simulating System 2 thinking in LLMs.

What would a more arboreal (read: tree-like) LLM even look like, though? Let’s sketch it out. Suppose we ask an LLM to solve a complex problem. On Yao et al.'s tree, a “thought” is a LLM response (typically less than a few sentences), serving as an intermediate step toward solving the larger problem. Each thought is a node. Starting from the top of the tree, let’s say an LLM generates three candidate thoughts (nodes). From each of these upper three nodes (parents), our LLM could branch out further, creating lower nodes (children). A branch extending from a parent node to a child node represents a modification on the parent node (thought), leading to the child node (thought), and so on.

Using this rough cognition-as-a-tree model, Yao et al. argue that standard LLM problem-solving approaches fail on two fronts:

  1. When LLMs are at some thought (node in a tree), they fail to expand that thought into several potential directions (the way we might consider a few different routes to bypass a traffic jam from a given point).

  2. LLMs don’t take advantage of global shortcuts like lookahead and backtracking. We don’t need to explore every possible solution in a crossword puzzle, for example, because we can eliminate words too long to fit in too few spaces or consider how a word we have in mind might affect other words we already wrote down (and vice versa). Ideally, LLMs ought to do the same.

Yao et al. hypothesized that tree structures might remedy these problems by giving LLMs a system that, given a current thought (node), generates numerous possible next thoughts (instead of just blindly pouncing on the most likely next thought) and then weighs these candidate next thoughts’ potential against a global goal, retracing or skipping ahead steps as needed. This all seems reasonable enough to try, but how can an LLM be reworked into a tree structure?

Crafting an LLM Tree

Yao et al. used four modules to give LLMs tree-like properties:

  1. Thought decomposer: breaks large problems into smaller steps (also present in CoT)

  2. Thought generator: given some current thought, creates a set of different candidate next steps for solving some problem—the more divergent these are, the better

  3. State evaluator: assess the merits of candidate ideas

  4. Search algorithm: systematically explores a tree of thoughts (e.g. via Breadth First Search, Depth First Search, A*, etc.)

If this is your first time hearing about tree search algorithms, don’t worry; they’re fairly intuitive once you see them, and, to understand ToT, we only need a general understanding of two of them—Breadth First Search (BFS) and Depth First Search (DFS). Below is an illustration of a BFS tree exploration:

GIF source: Blake Matheny, CC BY-SA 3.0, via Wikimedia Commons

GIF source: Blake Matheny, CC BY-SA 3.0, via Wikimedia Commons

For our purposes, remember that each circle (node) is an intermediary thought that might help an LLM on its journey toward solving a complex problem. Grey represents the next set of thoughts to be considered; black represents the set of explored thoughts. Notice how the thought exploration goes as wide as possible from each exploration stage (hence the name “Bread First Search”). It goes from a → b → c → d → c → f → g → h. Contrast this with the below illustration of Depth First Search (DFS):

GIF Source: Mre, CC BY-SA 3.0, via Wikimedia Commons

GIF Source: Mre, CC BY-SA 3.0, via Wikimedia Commons

DFS explores as deep as possible from any given current node before backtracking to the last unexplored fork in the tree and continuing its search. Above, it goes from 1 → 2 → 3 → 4 before depleting its local options and backtracking to 1. It then explores 5 → 6 → 7 → 8 → 9 → 10 (backtracking two more times along the way). Now that we have a general idea of how an LLM and tree structure might be fused together, let’s go through Yao et al.‘s’ four tree modules in more detail.

***As a fun side-note: If you want to see an official explanation of BFS and DFS from Stanford University, check out this video from our very own Jose Francisco. He taught algorithms before he became the resident ML Developer Advocate at Deepgram 😊***

Dissecting Thoughts

Because it’s tough (and possibly intractable) to craft a thought-decomposer that’s generalizable to all tasks, Yao et al. tailored individual thought-decomposers to the three tests they gave LLMs. Deconstructing whatever strategy is required to crush crossword puzzles, for example, is not the same as slicing up creative writing into steps, which is not the same as dissecting math problems. What this typically entails is humans providing an LLM with a few decomposed examples of whatever problem type the LLM is attempting to solve.

Idea Generation

Given a current thought, Yao et al. had their LLM generate its next thoughts via two primary strategies: sampling and proposing. Let’s consider how each of these idea generators would apply to planning a trip.

Sampling Ideas

To plan a long-term sojourn via the sampling method, you’d randomly sample from a list of destinations. Suppose you picked Madagascar, Mongolia, and Montenegro. Next, for each location, you’d randomly sample from a list of possible activities. Suppose you picked mountains, museums, and music. For each of these, you’d then sample from a list of possible avenues (specific trails, museums, or concerts). Sampling’s more diverse thought generation is suited for open tasks, like creative writing, where many feasible next thoughts might branch off any given thought, or for planning open-ended travel, like our above example. But sampling wouldn’t work well for designing a time-constrained weekend trip. For that, we’d want proposal-based idea generation.

Proposing Ideas

To plan a short trip via the proposal method, you’d pick a single destination (to start out) sufficiently close to you. Let’s say your in Mexico City, so you pick Merida. Then, you’d propose an activity relevant to Merida. Let’s say, touring nearby ruins. You might then propose a specific site, a specific restaurant in the vicinity to eat at afterwards, and a specific hotel nearby. One idea strongly influences the next. After all that, you could propose another location and start the process over again. Proposal-based idea generation lends itself well to constrained thought spaces, like weekend getaways or crossword puzzles, since these types of activities contain limited viable next thoughts from any given thought (e.g., whatever words you’ve already written down in a crossword puzzle constrain your remaining options and where you located limits where you can venture to for a weekend trip).

Taking Stock (of Thoughts)

Next, LLMs need some way to weigh candidate thoughts’ utility toward their overall goal. To do this, Yao et al.’s “state evaluator” (i.e., thought evaluator) estimates how well a thought might carry an LLM toward a solution by, well, asking the LLM to evaluate its own thoughts, acting like an unprogrammed, unlearned heuristic for a search algorithm.

It might sound odd to ask an LLM to evaluate the same thoughts that it just generated, but we too generate thoughts and evaluate those thoughts (with our own thoughts)—all with the same brain—and ample research demonstrates that self-reflection and self-critiquing assist LLMs in their problem-solving capacity. So maybe it’s not so strange.

Yao et al.’s state evaluator employs the following two flavors of self-evaluation, either individually or together:

  1. Value each thought candidate independently, meaning rating each thought (e.g., 1–10, “sure/likely/impossible”, etc.). This doesn’t need to be perfect; approximations are useful enough to simulate lookaheads or prune away obviously bad paths

  2. Vote across candidate thoughts, meaning pick the best answer. This is like casting numerous thoughts into a multiple-choice quiz and asking an LLM to guess the best answer. This method is better when there’s no clear answer or when it’s tough to assign a scalar value (e.g., when judging how coherent a written passage sounds, it’s easier to pick the clearest passage from a set of passages than it is to map a number to each passage).

And if we’re willing to sacrifice cost and compute for even more accurate results, we can ask LLMs to self-evaluate each state several times over, using the average. Ok, now that we see how the state evaluator allows LLMs to judge their own “thinking” at any step, the last ingredient needed is some strategy for bouncing from one thought to the next (and possibly back again).

Traversing Trees: Going out on Limbs and Shimmying Straight up Trunks

Though many established algorithms exist for exploring different types of tree data structures, Yao et al. tested their ToT concept with the two fundamental tree traversal algorithms we discussed earlier, Breadth-First Search (BFS) and Depth-First Search (DFS), leaving more advanced tree search algorithms like A-Star and Monte Carlo Tree Search for later studies.

The key part is that whatever tree search algorithm an LLM uses dictates how the LLM explores thoughts. In Yao et al.’s case, from any current node, BFS explores as widely as possible and DFS explores as deeply as possible. Now that we understand the modules necessary to make a ToT, let’s see how Yao et al. tested their ToT paradigm.

Three Problem-Solving Tests for LLMs’

Yao et al. devised three challenges for testing ToT: the “Game of 24” test, a creative writing test, and a crossword puzzle. These three tasks were designed to collectively measure LLMs’ ability to use “systematic planning or search” for "deductive, mathematical, commonsense, and “lexical reasoning.” When using standard Input-Output (IO) or CoT prompting (Yao et al.’s two baselines), GPT-4, the only LLM Yao et al. tested, struggled at all three tests, suggesting that these tasks were difficult for current State of the Art (SoA) LLMs. Spoiler alert, ToT prompting performed better than the baselines at all three tasks. We’ll get into just how much better it did as we learn about each test below.

Image Source: (Yao et al.)(Overview of the three tasks that Yao et al. tested ToT with)

Image Source: (Yao et al.)(Overview of the three tasks that Yao et al. tested ToT with)

Game of 24

This first task tests LLMs’ arithmetic reasoning in a way that demands some trial and error. The Game of 24 goes like this: Given four numbers, generate a set of arithmetic operations with those numbers that equals 24. Here’s a few-shot example:

  • Initial numbers: {2 3 5 10}:

  • (2 + 10) x (5 - 3) = (12) x (2) = 24

  • Initial numbers: {1 2 3 4}:

  • (1 x 4) x (2 x 3) = (4) x (6) = 24

Notice how there’s no obvious strategy that would work for every set of four numbers; solving these requires some experimentation.

Baselines

For one baseline, Yao et al. asked GPT-4 to solve the Game of 24 with standard IO prompting (just explaining the Game of 24, providing five example problems and their expected outputs with no intermediary steps, and then asking GPT-4 to solve the 100 problems). For the other baseline, they employed CoT prompting (asking GPT-4 to solve the Game of 24 but this time providing five examples of Game of 24 problems, each broken down into three intermediate steps leading to a correct solution, similar to the above examples). Yao et al. also tested CoT with Self-Consistency (running several parallel Chains of Thought and picking the answer that showed up the most often as the solution).

ToT Setup

For their Game of 24 ToT, Yao et al. used BFS with a “propose” idea generator and a “value” idea evaluator. A single “propose” idea generator prompt suggests generating an equation with two numbers (from an initial example set of four), then swapping that result with whatever number was used, maintaining and updating a “leftover numbers” set for a total of three times before arriving at a solution. Here’s what that looks like for one of our above examples:

  • Initial numbers: {2 3 5 10}:

  • Idea: (2 + 10) = 12 ; leftover numbers: {3, 5, 12}

  • Idea: (5 - 3) = 2 ; leftover numbers: {2, 12}

  • Idea: (2 * 12) = 24; leftover numbers: { }

Rather than hardcoding anything, Yao et al. simply gave GPT-4 dissected examples like the one above, hoping the LLM would learn to decompose on its own. At each step, the LLM kept the best or top five candidate solutions (Yao et al. tested two tree variants) per step. Using the above example, (2 + 10), (2 x 10), (10 - 2), etc. could all be candidate solutions at the first step.

Then, the evaluation step asks the LLM whether a candidate thought appears “sure/maybe/impossible” to reach 24. Ideally, LLMs should be able to “lookahead” a few steps to see when some proposed solution is too big or too small, the way that we can readily see that 100, 100, 100, 100 won’t amount to 24, no matter how hard we try. Though Yao et al. didn’t mention it, you’d expect more early thoughts to fall under the “maybe” category than “sure” or “impossible,” with the “sures” and “impossibles” increasing as tree exploration increases. Here’s what the thought generation and evaluation stages look like combined:

Image Source: Yao et al. (ToT Setup for the “Game of 24” Task; a = thought generation via left-over-number logic; b = thought self-evaluation (“sure/maybe/impossible”))

Image Source: Yao et al. (ToT Setup for the “Game of 24” Task; a = thought generation via left-over-number logic; b = thought self-evaluation (“sure/maybe/impossible”))

Results

In the Game of 24 test, IO, CoT, and CoT with Self-Consistency prompting achieved 7.3%, 4%, and 9% solve rates, respectively. Not great, to say the least. ToT, however, with only one branch per node, achieved a 45% solve rate, and with five branches per node, boosted accuracy all the way up to 74%.

Image Source: Yao et al. (“Game of 24” results)

Image Source: Yao et al. (“Game of 24” results)

Also impressive is that ToT even achieved superior performance to IO and CoT prompts while visiting fewer nodes (meaning thinking fewer thoughts).

Image Source: Yao et al. (“Game of 24” success rate compared to nodes visited (i.e., thoughts thought))

Image Source: Yao et al. (“Game of 24” success rate compared to nodes visited (i.e., thoughts thought))

Why is CoT so bad (in comparison to ToT) at the Game of 24? Yao et al. discovered through error analysis that slightly more than 60% of CoT candidate “thoughts” failed after their first step (meaning they set out on the wrong path with no chance of finding their way because chains don’t allow “turning around” to try a different fork in the path). This implies that something about a tree structure’s ability to explore multiple thought paths and back up when necessary grants LLMs superior reasoning abilities (in comparison to chain structures).

Image Source: Yao et al. (“Game of 24” samples failed at each step. Note how CoT often gets it wrong from the start.)

Image Source: Yao et al. (“Game of 24” samples failed at each step. Note how CoT often gets it wrong from the start.)

Creative Writing

Not the first attempt to measure LLMs’ creativity (and likely not the last), Yao et al. also devised a creative writing test to check how ToT affects LLMs’ creativity. The “Creative Writing” test goes like this: given four randomly generated sentences, ask an LLM to craft a single coherent passage comprised of four paragraphs, where the last sentence of each paragraph must be one of the four randomly generated input sentences in the order they were given. By using randomwordgenerator.com to generate random sentences that lacked accompanying ground-truth text, Yao et al. eliminated the possibility that GPT-4 trained on these sentences and/or their contexts.

Baselines

Yao et al. again used IO and CoT for the creative writing task baselines. For the IO baseline, they explained the creative writing task to GPT-4, gave GPT-4 the four random sentences, and then asked GPT-4 to generate a coherent set of paragraphs. For the CoT baseline, they asked GPT-4 to first generate a plan (something like an outline) and then generate a coherent set of paragraphs from that plan.

ToT Setup

GPT-4 generated five different plans to thread the four randomly generated input sentences into a coherent passage. Then, from these five potential plans, GPT-4 picked what it deemed the best plan (this “voting across” evaluation is like casting options into a multiple-choice quiz). With its best plan, the LLM generates passages (a paragraph for each of the four random input sentences) and then returns what it deems the most coherent passage.

Image Source: Yao et al. (ToT Setup for the “Creative Writing Task.” a = four input sentences that the LLM must generate a coherent passage from; b = five LLM-generated plans; c = five LLM-generated votes, the highest voted plan used to write the passage)

Image Source: Yao et al. (ToT Setup for the “Creative Writing Task.” a = four input sentences that the LLM must generate a coherent passage from; b = five LLM-generated plans; c = five LLM-generated votes, the highest voted plan used to write the passage)

Results

To gauge the coherence of crafted paragraphs, Yao et al. employed two methods:

  1. Machine Judges: asking GPT-4 to score the passages five times each, averaging these together

  2. Human Judges: asking human judges, blind to the study, to compare 100 randomly ordered pairs of CoT and ToT-generated passages, asking which of the pair was better

Image Source: Yao et al. “Creative Writing” task coherency scores as judged by machines (left) and by humans (right).

Image Source: Yao et al. “Creative Writing” task coherency scores as judged by machines (left) and by humans (right).

Machine-generated scores suggest that ToT isn’t as advantageous toward creative writing as it is toward the “Game of 24.” Machines scored ToT at 7.56, CoT at 6.93, and IO at 6.19. Iterative refinement (asking the LLM to evaluate and improve it’s own performance several times) brought IO performance close to ToT’s, which also calls into question whether it’s the ToT setup or the increased compute (accompanying iterative refinement and ToT) that helps GPT-4 at the creative writing task.

Similarly, humans didn’t exactly judge ToT as exceedingly better than CoT, finding ToT-generated passages more coherent than CoT-generated passages for 41 of 100 pairs, CoT as more coherent than ToT for 20 of 100 pairs, and finding the remaining 38 pairs tied.

Crossword Puzzles

Since the “Game of 24” and “Creative Writing” tasks only require GPT-4 to traverse three or fewer thought steps before spitting out its final answers, Yao et al. wanted a way to test deeper tree explorations; five-by-five mini crosswords serve this purpose.

For their final test, Yao et al. placed mini crossword puzzles scraped from GooBix.com into a test set (of 20 games) and a train set (of 5 games) composed of similar puzzles, each with five vertical and five horizontal clues as inputs and a 5 x 5 grid filled with letters (making up words that satisfy the 10 clues) as outputs. Yao et al. measured success with the portion of correctly answered:

  1. letters (max of 25 per game)

  2. words (max of 10 per game)

  3. games won

Baselines

As the crossword puzzle baselines, IO prompting enjoyed five pairs of example crossword puzzles (just clues and the words satisfying those clues), while CoT prompts had the same plus intermediate words ordered h1 (for horizontal_1) through h5 and v1 (for vertical_1) through v5, nudging the GPT-4 to solve crossword puzzles one clue at a time. Both baselines averaged the results of 10 samples.

ToT Setup

Using DFS, GPT-4 was asked to generate five candidates for what the next word should be and where it ought to go. The LLM is then asked to evaluate how confident it is in each candidate word solution (sure, low, high) and to order candidate solutions based upon this evaluation.

From any state, Yao et al. translated the LLMs’ previous thoughts into letter constraints. At any point, if the LLM decides that, given the letter constraints imposed by past choices, no possible word can satisfy some clue, the LLM can prune that subtree and backtrack to the parent node to explore a sibling branch (an unexplored thought).

Image Source: Yao et al. (ToT setup for crosswords. a = proposing and ordering candidate solutions; b = evaluating candidate solutions based on previous choices. If a subtree is deemed infeasible, it’s skipped)

Image Source: Yao et al. (ToT setup for crosswords. a = proposing and ordering candidate solutions; b = evaluating candidate solutions based on previous choices. If a subtree is deemed infeasible, it’s skipped)

Results

IO and CoT’s solved 16% of the words compared to ToT’s 60% solve rate, consistent with what we’d expect since IO and CoT can’t explore different options, backtrack, or “change their mind.” Intuitively, it seems that the backtracking in tree search ought to be useful for solving crosswords (most of us prefer a pencil and eraser to a pen when solving crosswords), but ToT only solved 4 of 20 crosswords. What’s with ToT’s less than stellar game solve rate?

One contributing factor is that mini-crosswords often contain rare words that GPT-4 might not have encountered during it’s training (an out-of-vocabulary word). In such cases, the ToT’s state evaluator may have erroneously judged the rare word as impossible and hacked off the branch containing that word.

Investigating this possibility, Yao et al. ablated away the pruning mechanism, but this resulted in poorer performance, leading them to conclude that a better decision-making process for pruning was in order. Yao et al. also ablated the ability to backtrack by making a DFS variant that always processes the clue it deems most promising, overwriting past choices when necessary. This approach fared worse, suggesting backtracking helps ToT (at crossword puzzles at least).

Image Source: Yao et al. (Crossword Puzzle Results. -prune and -backtrack = ablation studies)

Image Source: Yao et al. (Crossword Puzzle Results. -prune and -backtrack = ablation studies)

ToT’s Shortcomings and Contributions

Though helpful for some types of problems, ToT prompting is not without its downsides. First, Yao et al. admit that ToT’s intentional, deliberate approach costs more than IO or CoT prompting since ToT often requires many LLM queries to solve a single problem (remember, ToT uses LLM calls to generate candidate thoughts and to evaluate each generated thought—potentially across many branches of a tree—burning through tokens fast).

Machine learning researcher and engineer Yannic Kilcher points out another important shortcoming of ToT: ToT (at least as implemented by Yao et al.) requires us to hold an LLM’s hands quite a bit, spelling out examples that show how to decompose specific types of problems, similar to what CoT requires, and tailoring idea generation, evaluation, and tree search toward specific problems. Ideally, we’d like prompt engineering techniques with fewer training wheels. Despite these shortcomings, ToT still contributes a lot.

An obvious advantage that ToT enjoys over CoT, for example, is gut-checking different thought options before considering them by simulating “planning, lookahead, or backtracking,” akin to how we humans sometimes rule out certain options to limit our search space rather than explore every possible option. Additionally, ToT offers us an interpretable LLM approach to problem solving since we can literally read each LLM “thinking” step instead of combing through non-intuitive embeddings, barely walking away with a sense of why an LLM made the decision it did.

It’s biggest contribution, though, is that ToT’s approach—amalgamating LLMs with established classical data structures and algorithms—isn’t limited to tree search and will inspire people to experiment wide and deep, fusing LLMs with more classical algorithms in novel ways, which is precisely what our next {X}-of-Thought article will be about—combining graph theory with LLMs, so stay tuned!

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.

Sign Up FreeBook a Demo