Article·AI & Engineering·Apr 23, 2024
18 min read

Implementing a local LLM chatbot that can run code and searches

18 min read
Zian (Andy) Wang
By Zian (Andy) Wang
PublishedApr 23, 2024
UpdatedJun 27, 2024

In the evolving landscape of artificial intelligence, Large Language Models (LLMs) such as OpenAI’s GPT-4 have not only revolutionized the way we interact with data but also expanded the horizon of what intelligent agents can accomplish. These models, trained on diverse internet text, excel in understanding and generating human-like text based on the input they receive. However, the real magic begins when these models hold more power than generating plain text. GPT-4’s ability to run code, search the internet, and integrate with hundreds of plugins enables the automation of many tasks and expands the capabilities of these LLMs to new heights.

This tutorial is designed to lead you through a step by step guide on how to build a completely local LLM chatbot with some of these “plugin” capabilities. Specifically, the “agent” will be able to execute arbitrary Python code and have access to a basic search tool. With some intricate prompting and careful crafting of the system message, the model should be able to perform a wide variety of tasks with access to these tools.

It should be noted that there are libraries and frameworks specifically designed to construct LLM-based agents such as LangChain. However, with the plethora of options and configurations that LangChain offers, building a fully-fledged agent would be a rather tedious task and is outside of the scope of the article.

Installations and Setup

For this tutorial, we are going to use ollama to run LLMs in our local environment. It offers a simple CLI and local API for response generation. To work with its APIs, we could use Python’s request library but ollama’s Python library offers a much cleaner, easier solution. Installation instructions for ollama are available on their Mac and Windows download pages, and for Linux users:

curl -fsSL https://ollama.com/install.sh | sh

To install the Python library for ollama, type:

pip install ollama

In terms of Large Language Models, ollama makes it incredibly simple to run models that are quantized in the GGUF format, only requiring a few steps to start chatting with the LLM.

Installing the LLM

As mentioned previously, ollama utilizes the GGUF format for running LLMs. There are thousands of LLMs on HuggingFace that are free to download and use with a majority of them having multiple quantized versions of different sizes in the GGUF format. For this article, we are going to use a fine-tune of Mistral 7B, the Mistral Dolphin but any GGUF model should work if the memory constraints of the environment are sufficient for the LLM.

The mistral dolphin, from Cognitive Computations, is a 7 billion parameter model that is exceptionally excellent at coding and conversation compared to other models of the same size while being uncensored. The model can be run on most GPUs with at least 8 gigabytes of VRAM and is usable even without GPU acceleration with a decent CPU.

To setup the model for ollama, create a file named “Modelfile” without any extensions. The “Modelfile” contains the path to the GGUF file and any additional configurations that the user may wish to tinker.

Every model file must contain the “FROM” instruction indicating the GGUF model that the file is pointing to.

FROM /path/to/gguf/file/Dolphin 2.6 Mistral 7B.gguf

Next, the “TEMPLATE” instruction specifies the prompt template that the model is using. Mistral Dolphin uses the ChatML prompt template, which can be written in the Modelfile as such:

TEMPLATE """{{ if .System }}<|im_start|>system

{{ .System }}<|im_end|>

{{ end }}{{ if .Prompt }}<|im_start|>user

{{ .Prompt }}<|im_end|>

{{ end }}<|im_start|>assistant

"""

The final parameter that is commonly specified in the Modelfile is the system message. However, we will leave it blank for our project as it can be defined before starting each conversation through the ollama Python library. In order to distribute changes in the Modelfile to the actual LLM, ollama has to “recompile” the model every time, which is not only a waste of resources but also precious time.

To actually “link” the Modelfile to the GGUF model that we downloaded, we need to enter the following command into the terminal:

ollama create model-name -f /path/to/Modelfile

The parameter “model-name” can be any name that you wish the model to take on. This information does not affect the behavior of the model but rather provides a named reference to ollama and the user when running the model.

Setting Up a Command Line Interface

Before integrating any fancy features into the LLM, we need a basic chat interface to interact with the model. For simplicity, we will be creating a bare-bones command line interface to query the model and receive responses. However, if you are seeking for a more user-friendly interface, checkout this article on how to implement an end-to-end chatbot with audio capabilities using streamlit.

The Python ollama library provides a few helpful functions to interact with ollama’s API with ease.

The “chat” method will invoke the model specified by the “model” parameter and feed it the list of messages provided in the call. The message can contain a single user query or a list of past conversations, with each distinct response/query contained in a dictionary.

We can also stream the model’s response by changing stream to True. The method will then return a generator in which we can iterate through to receive response token by token. At each iteration, the content of the model response will be nested within a json object. To access the string representation of the returned token, we can do the following

With the above information in mind, we can create a basic chat interface that takes user input, streams the model response, then appends the exchange to a message history and continues until the user terminates it.

Now with the basics out of the way, we can start building some advanced functionality to equip the model with more than just plain text.

Executing Python Code

To execute code produced by the LLM, it is essential to accurately extract and implement the code directly from the model’s responses. Typically, LLMs output code in markdown format, encapsulated within triple backticks. While we could naively use regular expressions to identify all code blocks with the Python language in a model response and subsequently execute them, this method presents several challenges.

First, depending on the context and the user’s query, not all Python code may be meant to be executed. For example, the user could be asking for an example snippet for a specific framework and does not require an output or it’s impossible to execute without previous context and the correct environment.

Second, allowing the model to continue responding after writing code that is intended to be executed could cause confusion and even hallucination. For example, the code interpreter plugin in GPT-4 executes the code as soon as the model finishes writing, to avoid potential confusion and responses that may hallucinate the execution results.

To tackle the first challenge, we will limit code execution to blocks tagged with a specific identifier, such as “Python-exe” for Python code. This tag will be communicated to the model via the system message, ensuring that only the intended code is executed.

In markdown syntax, the language of a code snippet is specified immediately after the opening triple backticks. For example, C++ code is denoted as follows:

Addressing the second issue involves a more straightforward approach. Throughout the model’s response, we will search for the sequence "Python-exe" along with the delimiting triple backticks. Once we identify both the opening triple backticks tagged with “Python-exe” and the corresponding closing backticks, we will terminate the model’s response and extract the encapsulated code.

To find the triple backtick pattern described above, we can use Python’s regular expression module like so:

match = re.search(r'```Python-exe\n(.*?)```', full_response, re.DOTALL)

To actually extract the content that was matched, we will do the following:

To execute the code, we will use Python’s built-in function “exec”, which executes code inputted as a string. Note that using “exec” to run arbitrary code is not a safe practice by any means and should not be used in production as “exec” can run any code that is able to run on a normal machine with a full Python installation.

After the execution, we will redirect the output and errors, if any, to a string using contextlib:

Finally, we will resume the model response by feeding it the output of the code as a user prompt, with its previous response that invoked the code execution included in the history.

To complete our setup, we need to implement a loop that manages the interaction between the model’s responses and the code execution process. Specifically, if the model’s response includes any executable code, this code will be run, and its outputs will be fed back to the model. Should there be any errors, the model is likely to revise the code accordingly.

This process repeats until the model’s response contains no executable code, indicating that it has successfully processed the outputs. At this point, the model will typically incorporate these outputs into its final response, thus breaking the loop.

Putting everything together, we will encapsulate the code execution logic into a interact_with_model function, which takes a list of message history and handles the interaction we described above.

There is one last step that we need to complete in order for everything to work: the model’s system prompt. In addition to typical system prompts that instruct the LLM to be a helpful assistant and answer all questions to the best of its abilities, we also need to inform the model of its “extra” capabilities and “teach” the LLM how to use them.

The prompting part is actually much trickier than what one might expect, especially for smaller LLMs such as the Mistral 7B. Without clear instructions telling the model exactly what to do, the LLM may attempt to execute code in other languages, or write code in the execution “Python-exe” language without intending it. After many tries, the below prompt is what I found to be the best:

Putting everything together, the main function would look like this:

When run, the LLM will be able to use Python to assist with its responses whenever it deems necessary. Remember that the model has the same access to packages and other tools as you do in the environment that the code is running in, which means that some packages that the model imports may not be available.

Here’s the LLM successfully calculating the square root of 2, rounded to 10 decimal places using Python.

Being able to execute Python code is a step forward for the LLM, but we can take this further.

Modularizing Tools for LLM

We can expand the functionality of the Python code execution by equipping the LLM with additional tools. A significant limitation of current LLMs is their inability to access information beyond their training cut-off. To address this, many chatbots include a browsing feature that allows the model to search the internet for information outside of its built-in knowledge base.

To integrate this capability into our LLM, we will organize the execution of each tool into separate functions. A “tool selection” function will determine the appropriate tool to use based on the LLM’s response.

For example, to perform a Google search, the model will formulate its query in a syntax similar to executing Python code, enclosing the search terms within code blocks marked by the “google-search” language.

During the response parsing phase, we will employ the “tool selection” function that identifies the intended tool by checking the language specified in the code block. This function will then facilitate the use of the correct tool. Moreover, to streamline the process and minimize confusion, the system prompt will restrict the model to using only one tool at a time.

We can modify the regular expression and the matching logic as follows:

The “execute_tool” function will then call the corresponding tools to process the request.

Specifically, the “execute_tool” function is as follows:

The “execute_Python_code” function remains the same while the “perform_google_search” function will return the search results based on the query and is not implemented yet.

To implement a google search function without being overly complex, we will return the top 5 search results including their title and corresponding descriptions. This will contain much less functionality than giving the model full access to a browser but will serve our purposes just fine with an incredibly simple implementation.

To perform a google search, we will use the Python library "googlesearch", which uses requests and BeautifulSoup4 to scrape Google. To install it, type

pip install googlesearch-Python

In the “perform_google_search” function, we will limit the result to 5 items and set the “advanced” parameter to true in order to receive additional properties about the result other than the url.

We will also modify the “interact_with_model” function to include the tool type in the execution result to further avoid confusions:

Finally, we will modify the system prompt to inform the model of its newfound search abilities.

The main function will remain unmodified.

By structuring the use of various tools into modular functions and encapsulating the request processing and execution functionalities separately, we significantly simplify the integration of new tools.

To include an additional tool, we simply extend the regular expression used to identify the tool requests in the model’s output. For instance:

r'```(Python-exe|google-search|another-tool)\n(.*?)```'

Adding the support for a new tool involves inserting an “elif” statement in the execute_tool function and defining a corresponding function to handle the tool’s specific requests. This modular approach allows the model to seamlessly incorporate new capabilities to better assist users (such as adding a tool to perform TTS using Deepgram’s API!).

Of course, it’s crucial to update the system message with descriptions of the newly supported tools, so the model clearly understands the tools at its disposal.

It is important to note that with smaller models, a limited number of tools—typically two to three—are advisable. More tools might overwhelm the model, leading to confusion, hallucinations, or incorrect decisions about which tool to use in a given context.

Conclusion

The integration of Python execution and simple web search functionalities into a local LLM, such as the Mistral-based Dolphin, significantly enhances the utility of these models. By modularizing these capabilities into distinct functions and controlling their access through a clear and concise system message, we provide a structured and efficient way to expand the model’s capabilities.

This approach not only streamlines the implementation of additional tools but also helps in maintaining clarity and reducing the cognitive load on smaller models, which might otherwise struggle with multiple concurrent functionalities.

To explore more advanced functionalities, LangChain’s integration with the "ReAct" prompting strategies will present the LLM with a crystal clear approach to using tools and calling outside APIs. However, it is much more of a hassle to set up with a local LLM along with including conversation history.

Note: If you like this content and would like to learn more, click here! If you want to see a completely comprehensive AI Glossary, click here.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.