API-Bank: Benchmarking Language Models’ Tool Use
Foundational Large Language Models (LLMs) are very much “jack-of-all-trades, master-of-none” entities; they can do a lot decently but often stumble on tasks requiring specialized or obscure knowledge.
Fine-tuning foundational LLMs can, of course, help improve their ability to handle esoteric tasks, but fine-tuning foundational LLMs for every narrow task under the sun would be costly, and we’d still need some sort of “brain” logic to decide which fine-tuned model to use in which circumstance. To become more useful to us (and more efficient), LLMs need to use tools. Often taking the form of APIs, tools allow LLMs to do things like call other models, retrieve information, interpret code, act on the physical world, and more.
Shortly after ChatGPT brought LLMs’ potential (and shortcomings) mainstream, researchers and the open source community began investigating how to extend the capabilities of LLMs with external tools. A few such efforts are:
OpenAI’s addition of plugins to ChatGPT
The LangChain ecosystem’s many options for augmenting LLMs with external tools
LLM agent frameworks designed for LLMs to semi-autonomously employ tools to accomplish human-defined tasks
ToolFormer found that LLMs could teach themselves to use external APIs
Berkley and Microsoft Research’s Gorilla, an LLM trained specifically to call APIs
ToolkenGPT turned tools themselves into tokens (tool + token == “toolken”), embedding tools similarly to how we tokenize words and subwords in LLMs
Needless to say, excitement about improving LLMs’ abilities via tool augmentation is brimming. With all this experimentation, it’d be nice, though, to test how well (or poorly) LLMs actually employ tools, which is why, in April 2023, Li et al. created API-Bank, the first benchmark testing tool-augmented LLMs. Specifically, API-Bank tests LLMs’ abilities to find APIs relevant to a user-defined goal and then plan and execute API calls to accomplish that goal.
LLM Tool-Augmentation Approaches
There are two main approaches toward augmenting LLMs with tools:
In-context learning (i.e., showing a pre-trained model examples of how to use tools)
Fine-tuning (i.e., feeding a pre-trained LLM annotated data relating to tools (e.g., API documentation))
A shortcoming of in-context learning approaches is that examples must remain within an LLM’s context window, which may be too short to provide sufficient examples. Aware of this weakness, Li et al. designed API-Bank as an in-context learning tool-augmentation approach that overcomes context length limitations via the following three key components:
An “API Pool” of APIs for various tasks
A keyword-based search engine, “ToolSearch”, that retrieves relevant APIs from API Pool
Prompts explaining to an LLM a user-defined task and how to employ ToolSearch
Is a Tool Even Necessary?
Here’s how these components interact: Given a user request (e.g., book me a trip somewhere sunny), an LLM’s first step is determining if it can fulfill that request with its own internal knowledge or if calling APIs would be a better approach, meaning the LLM needs some sense of its own “known unknowns.”
At this stage, the LLM has three options:
It can respond using its internal knowledge
It can ask the user for additional clarification
It can go ahead with an API call (other LLM-tool forms exist, like programming a function from scratch, but API-Bank, as its name suggests, only tests LLMs’ API utilization abilities).
Digging into the Toolbox
An LLM’s next step is finding the right tool for the right job (i.e., the most suitable API for the user’s request). To do this, before making any API calls, the LLM summarizes a user request into a handful of keywords and then inputs these keywords into ToolSearch (the API search engine), which queries the API Pool (the collection of available APIs) to find the most relevant API. After that, the LLM receives the candidate API’s documentation (the API’s function description and its input and output parameters).
Evaluating Tools
With an API’s documentation, the LLM then decides if that API looks worth a try or if it’s back to the drawing board (i.e., tweak the keywords a bit, search for a new API, check its documentation, and repeat the cycle). The LLM has the option of throwing in the towel here, giving the user an LLM version of the blue screen of death (e.g., "As a large language model, I cannot…”), but if the LLM finds an API suited for the user’s task, it calls that API, and the decision tree branches out a few more times.
Calibrating Tools
An API might return results relevant to the user’s request. In this case, the LLM would pass those on to the user. But an API might also return an exception (i.e., an error message). When encountering an exception, an LLM ideally attempts to use that exception message to modify the API call and try again. Should that fail, the LLM can inform the user that it can’t solve the task with the available APIs. Below is a diagram of all API-Bank’s components and decision flow:
Now that we have an idea of API-Bank’s components and processes, let’s take a quick look at a few of its implementation details before diving into the evaluation process.
API-Bank’s Implementation
Li et al. constructed 53 APIs for the API-Bank benchmark, spanning search engine, calendar, smart home control, and hotel reservation APIs, as well as other artificial intelligence (AI) models like image captioning, speech recognition, translation, and document question-answering models. API Bank ignores each of these APIs’ internals, focusing only on their interfaces (i.e., LLMs must use APIs’ names, function descriptions, and input and output parameters to infer things about the API).
Three databases were pre-initialized with random relevant data because their associated APIs needed to modify already-populated databases to work correctly. API-Bank also mimicked user authentication for APIs that required authentication by retrieving a token from an “Account” database and then passing that token to APIs that required authentication.
A “ToolManager” assisted APIs with accessing and modifying accompanying databases and allowed LLMs to interact with operating system interfaces when doing so was necessary for an LLM to manipulate applications.
Finally, to ensure consistent testing conditions, ToolManager wiped the initial test environment so that it was the same before each turn and ensured database changes persisted within one round of turns (i.e., until the LLM returned a final answer to the user).
API-Bank’s Three-Tier Evaluation
API-Bank used three layers of automatic and manual evaluation across 264 annotated dialogues that included 568 API calls. The evaluation prompts mimic dialogues that you’d expect between a human and ChatGPT, for example. You’ll see what these look like as we review each evaluation tier.
To Call or Not to Call: Level-1 Evaluation
Level-1 evaluation tests how well LLMs decide to call an API and whether they call APIs correctly. To do this, LLMs are fed an explicit hint to call an API and that API’s description within a prompt. Inferring APIs’ capabilities from their description in the prompt, LLMs should decide whether to call an API and, if so, pass the parameters to do so.
To evaluate the API call itself, API-Bank then parses these parameters and sends them to the appropriate API, verifying their correctness.
Finding the Right Tool: Level-2 Evaluation
Instead of spoon-feeding an LLM the exact API suitable for a user’s request, Level-2 evaluation checks an LLM’s ability to find the right tool for the right job absent any API documentation in the prompt other than that for ToolSearcher (i.e., it tests how well LLMs identify suitable APIs by harnessing an API search engine).
For levels 1 and 2, API-Bank compared the LLMs’ responses that contain APIs’ returned results with ground-truth results. Li et al. admit this isn’t a perfect method since there are often numerous acceptable responses, but measuring similarity to one ground truth answer was an obvious way to measure LLMs’ ability to call APIs at scale. To quantify LLMs’ total accuracy, API-Bench simply divided the number of correct predictions by the total number of predictions.
Employing the Entire Toolbox: Level-3 Evaluation
Level-3 checks how well LLMs employ multiple APIs (and often multiple calls to them) to meet broad user requests. A level-3 user request, for example, might be:
Ideally, the LLM would then use the appropriate APIs to check weather, flights, trains, hotels, calendars, etc.
Level-3 evaluations involved human testers—guided by the below prompt—answering LLMs’ questions until the broad task was solved. Quantifying LLMs’ planning performance involved tallying up the dialogue turns it took an LLM to complete API calls and comparing that number to how many turns humans—blind to the test and given the exact same details as the LLM—took to complete the same multi-pronged task by invoking API calls.
For Level-3’s evaluation, the fewer turns it takes to fulfill a user’s goal, the better. So a LLM that planned your vacation in four turns performed better than an LLM that took 100 turns. An LLM was considered to have completed a user-defined goal when the LLM, with the help of human testers’ guidance, called an API with the same parameters as the ground truth (that human testers defined). Below is an example task and its coinciding ground truth API calls.
Results
Lit et al. tested GPT-3 DaVinci and GPT 3.5 Turbo models (they didn’t yet have access to GPT-4’s API) on Levels 1 and 2 evaluations and tested GPT-3.5 Turbo, GPT-4, and humans on Level-3 evaluations.
GPT-3.5 Turbo fared alright, accurately calling APIs 50% of the time during the Level-1 evaluations, whereas GPT-3 DaVinci rarely ever called APIs successfully. For Levels 1 and 2, the tasks that GPT-3.5 Turbo tripped over the most were ones requiring multiple rounds of interdependent API calls. Additionally, Li et al. found that LLMs often failed to call APIs even when prompts blatantly instructed them to do so.
Other frequent LLM mistakes were not calling the right API for the job—often stemming from hallucinations (i.e., instead of searching for a relevant API with ToolSearcher, the LLM just made up an API)—or the LLM forgetting to authenticate APIs that required credentials.
Unsurprisingly, GPT-4 outdid GPT-3.5-Turbo on most of the Level-3 planning-oriented evaluations. GPT-4 solved tasks in 38% fewer turns on average than GPT-3.5 Turbo, and humans, besting both models, completed tasks in 35% fewer turns on average than GPT-4.
For Level-3 tests, GPT-3.5-Turbo performed well at a few things, including:
Modifying the queries it sent to ToolSearch when ToolSearch initially suggested an API irrelevant to the user’s task
Avoiding repeatedly asking the user for the information required to call an API when such information was mentioned earlier
Converting information into the proper format for an API call
GPT-3.5 Turbo showed some flaws too, though. It sometimes hallucinated APIs’ missing input parameters instead of asking the user for them. Additionally, since it often didn’t summarize users’ goals into keywords well, GPT-3.5 Turbo failed to fully utilize ToolSearcher. It also sometimes didn’t quite get that it could call APIs itself, asking the user to do so instead.
Another flaw of GPT-3.5 Turbo is that it sometimes defied human logic. On one occasion, for example, GPT-3.5 Turbo lacked a user authentication token for accessing a database. Rather than requesting credentials from the user, GPT-3.5-Turbo just made up its own. Then, when using its hallucinated authentication token threw an error, GPT-3.5-Turbo proceeded to generate a new entry in the database to match the user authentication token that GPT-3.5 Turbo wanted to use. GPT-3.5-Turbo then solved the rest of the task—booking a calendar appointment—with the fictional user and its coinciding credentials that it just hallucinated, wiped its hands, and called it a day.
API-Bank’s Contributions
Though Li et al. have yet to open source their code, as the first LLM-tool benchmark, API-Bank is a useful proof-of-concept for people to build follow-on LLM-tool benchmarks. By identifying the decision tree that LLMs typically take when calling APIs, API-Bank identified and evaluated several important tool-use aspects, including the decision to call APIs, how accurately APIs are called, how relevant APIs are to user-defined tasks, and the ability to efficiently call a combination of APIs to solve a multistep task. It’ll be interesting to watch how LLM-tool benchmarks are refined and expanded since there are millions of APIs that LLMs could potentially tap as tools (much more than the 53 that API-Bank tested) and since LLMs are not limited to premade APIs as their sole form of tools (they can code and test their own functions too). Future LLM-tool benchmarks that test LLMs’ capacity to employ a wider variety of APIs or benchmarks that test LLMs capacity to produce their own tools will reveal much more, but API-Bank is a great starting point.
Unlock language AI at scale with an API call.
Get conversational intelligence with transcription and understanding on the world's best speech AI platform.