Resources Article Tutorial: Building an end-to-end LLM Chatbot

Tutorial: Building an end-to-end LLM Chatbot

Zian (Andy) Wang

Published on 04/05/24

Table of Contents

Installing Dependencies Choosing a LLM Setting up a Chat Interface Integrating a Local LLM Integrating Audio Capabilities Bonus: Turning Your Siri into Intelligent LLMs

Share this guide

With the revolutionizing LLMs that have taken the world by storm, they have enabled any users with access to decent laptops or computers to run their own model. Today, we are going to build an end-to-end Chatbot with audio capabilities, with a bonus tutorial on how you can change your apple devices’ Siri to respond with the intelligence of a LLM!

There are tens if not hundreds of APIs, libraries, and UI options to build your own LLM applications. They range from convoluted chat interfaces with multimodal abilities to bare bones command line interfaces. Today, we’re going to focus on building a simple chat interface with Streamlit that has speech-to-text and text-to-speech capabilities running on a local LLM.

Installing Dependencies

For this project, we are going to use Ollama to run LLMs, serving as our “backend” for the chat interface. Ollama comes with a ready-to-go bare bones command line interface for chatting with local LLMs as well as a local API that we can request from our script to generate responses. It’s extremely simple to install, setup, and subsequently use with more than enough customization options for a casual user.

To install ollama for Mac, go to their downloads page and follow the instructions. For Linux users, simply run this command:

curl -fsSL https://ollama.com/install.sh | sh

There is also a preview version for windows, which can also be found on their downloads page.

Streamlit, the library that we are using for the frontend UI, is an open source python tool capable of creating clean, customizable, and beautiful UI for machine learning and data science applications. Streamlit has many pre-built components that can be easily accessed through simple python functions. Streamlit can be installed via pip:

pip install streamlit

Finally, for audio transcription and generation capabilities, we are going to use Deepgram. Deepgram offers an incredible plethora of tools for analyzing audio data at lightning fast speeds with some of the cheapest prices among its competitors. In our case, we are going to use their speech-to-text and text-to-speech models.

To get started, sign up on Deepgram’s website here. Upon successfully registering, you are automatically given $200 worth of free credits without having to input any payment information, much more than enough to suffice our simple project.

Since we are going to be using python for this project, we are also going to install Deepgram’s python SDK through pip:

pip install deepgram-sdk

Choosing a LLM

There are thousands of open source LLMs on HuggingFace available for download. For this project, using any of the models that are downloadable in GGUF formats should work. The user "The Bloke" on HuggingFace has more than 1000 models that are converted to the GGUF format with different levels of quantizations. There are a few things to note with downloading LLMs:

Make sure your local machine has enough memory. Typically, a rule of thumb for memory requirements is that as long as your computer has enough memory for the model file plus a third of the model’s size should be good.
Make sure your local machine can run the model at a decent speed. This usually isn’t a problem, most modern NVIDIA GPUs and almost all apple silicon macs can run 7 billion parameter to 13 billion parameter models at more than 1 token per second. And don’t worry if you don’t have these fancy GPUs, most 7 billion parameter models run smoothly on mid to high end CPUs.
Take note of the model’s prompt template, we are going to use this later with ollama.

If there’s a model that isn’t available in GGUF format but can be downloaded as safetensor files (for example, many custom models on the OpenLLM Leaderboard), it can be easily converted to GGUF files with llama.cpp. For more details, take a look at this reddit thread.

For this example, we are going to use Dolphin Mistral, a fine-tuned, uncensored model based on the Mistral 7B model.

Setting up a Chat Interface

Streamlit provides two UI components specifically designed for chatbots: chat_message and chat_input. We can use the following code to write a singular chat message:

import streamlit as st 
with st.chat_message("user"): 
    st.write("Hello World!")

To run the app, type the following command into your terminal:

streamlit run path_to_your_file.py

Streamlit will automatically open up a browser window for your app.

Similarly, we can set the role of our message to obtain a response formatted with a different avatar, representing the AI. For the message, instead of using st.write, we will use st.markdown as most LLMs are trained to output in markdown format.

with st.chat_message("assistant"):
    st.markdown("I'm a LLM")

We can also save our conversation history in the session state of our app. Upon initializing the session state as an empty list, we can append dictionary-like items for each of the agent’s response and the user’s text.

# Initialize chat history
if "messages" not in st.session_state:
    st.session_state.messages = []

# user prompt
st.session_state.messages.append({"role": "user", "content": "sample prompt"})
# assistant response
st.session_state.messages.append({"role": "assistant", "content": "sample response"})

Finally, we can handle user input through st.chat_input like so:

if prompt := st.chat_input("How can I help you?"):
    # Display user message in chat message container
    st.chat_message("user").markdown(prompt)

Putting them all together, we have a basic chat interface.

import streamlit as st

# Initialize chat history
if "messages" not in st.session_state:
    st.session_state.messages = []

# Display chat messages from history on app rerun

for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"]) 
        
    # React to user input
    if prompt := st.chat_input("What is up?"):
        # Display user message in chat message container
        st.chat_message("user").markdown(prompt)
    
    # Add user message to chat history
    st.session_state.messages.append({"role": "user", "content": prompt})
    
    # For now, we will set the response to mirror the prompt    
    response = f"{prompt}"
    # Display assistant response in chat message container
    with st.chat_message("assistant"):
        st.markdown(response)
    # Add assistant response to chat history
    st.session_state.messages.append({"role": "assistant", "content": response})

Integrating a Local LLM

An agent that always repeats the user’s request isn’t really helpful, we must have the ability to integrate an actual LLM for the job.

First, we’ll construct a model using the GGUF file we downloaded with ollama. Following that, we’ll integrate this model into our script by invoking its API, allowing us to generate responses.

In order to run the model, ollama requires the user to create a “modelfile” which contains the path to the GGUF file and any additional configurations that the user may wish to tinker.

Create a file named “Modelfile” without any extensions. Every model file must have a “FROM” instruction specifying the base model to run from. For example, in our case of the Mistral Dolphin model, the first line of the model file would look like this:

FROM /path/to/gguf/file/Dolphin 2.6 Mistral 7B.gguf

Then, the “TEMPLATE” instruction specifies the prompt template that the model uses with the Go Template Syntax. The Mistral Dolphin model uses the ChatML prompt template and our “TEMPLATE” instruction would be the following:

TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
"""

Depending on the model’s prompt template, this instruction may look different.

Finally, the user can optionally provide a system message for the model with the “SYSTEM” instruction followed by a message describing the behavior of the model.

SYSTEM system message here

Heading over to the command line, we will let ollama compile and create the model based on our configuration through the following command:

ollama create model-name -f path/to/Modelfile

Alternatively, we can also do this directly in our code by utilizing ollama’s api to create a model:

import requests 
url = "http://localhost:11434/api/create" 
data = { 
        "name": "model-name", 
        "modelfile": "FROM /path/to/gguf/file/Dolphin 2.6 Mistral 7B.gguf TEMPLATE..." 
        } 
response = requests.post(url, json=data)

We can actually use ollama’s built-in command line interface to run the model by typing

ollama run model-name

But that’s not what we’re here for; it’s incredibly simple to make an API call to ollama locally through our code. By default, ollama streams its response as json objects, token by token. We can build a python function to handle the response through a generator:

def stream_content(url, data):

    """
    Make a streaming POST request and yield the 'content' part of each JSON response as they arrive.
    :param url: The URL to which the POST request is made.
    :param data: A dictionary or JSON string to be sent in the body of the POST request.
    """
    
    headers = {'Content-Type': 'application/json'}
    
    # If data is a dictionary, convert it to a JSON string
    if isinstance(data, dict):
        data = json.dumps(data)
    
    # Make a streaming POST request
    with requests.post(url, data=data, headers=headers, stream=True) as response:
    
        # Raise an error for bad responses
        response.raise_for_status()
        
        # Process the stream
        for chunk in response.iter_lines():
            if chunk:
                # Decode chunk from bytes to string
                decoded_chunk = chunk.decode('utf-8')
                
                # Convert string to JSON
                json_chunk = json.loads(decoded_chunk)
                
                # if the model is done generating, return
                if json_chunk["done"] == True:
                    return
                
                # Yield the 'content' part
                yield json_chunk["message"]["content"]

In the above code, we made a POST request with the provided “url”, and as content is being streamed in, we parse each line and decode it into json format. The json object returned contains various fields with the actual response generated by the model within the “message” key’s “content” key. Finally, we check if the model has finished generating its response, if so, the function returns, marking the completion of the request.

On ollama’s side, the api is structured as follows:

curl http://localhost:11434/api/chat -d '{
  "model": "model-name",
  "messages": [
    {
      "role": "user",
      "content": "prompt"
    }
  ]
}'

The message field contains the user’s prompt with the option of passing in entire chat histories, making multi-turn conversations possible.

By reusing our code for a basic chatbot interface, we can replace the mirroring response with our “stream_content” function and pass in our session state’s messages for a complete history of previous conversations. Putting everything together, our simple chatbot can be written in around 60 lines.

import requests
import json
import streamlit as st

def stream_content(url, data):
    """
    Make a streaming POST request and yield the 'content' part of each JSON response as they arrive.

    :param url: The URL to which the POST request is made.
    :param data: A dictionary or JSON string to be sent in the body of the POST request.
    """
    headers = {'Content-Type': 'application/json'}
    
    # If data is a dictionary, convert it to a JSON string
    if isinstance(data, dict):
        data = json.dumps(data)
    
    # Make a streaming POST request
    with requests.post(url, data=data, headers=headers, stream=True) as response:
        # Raise an error for bad responses
        response.raise_for_status()
        
        # Process the stream
        for chunk in response.iter_lines():
            if chunk:
                # Decode chunk from bytes to string
                decoded_chunk = chunk.decode('utf-8')
                
                # Convert string to JSON
                json_chunk = json.loads(decoded_chunk)
                
                # if the model is done generating, return
                if json_chunk["done"] == True:
                    return
                # Yield the 'content' part
                yield json_chunk["message"]["content"]

# Initialize chat history
if "messages" not in st.session_state:
    st.session_state.messages = []

# Display chat messages from history
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

request_data = {
    "model": "dolphin",
    "messages": []
}

# Accept user input
if prompt := st.chat_input("What is up?"):
    # Add user message to chat history
    st.session_state.messages.append({"role": "user", "content": prompt})
    # Display user message in chat message container
    with st.chat_message("user"):
        st.markdown(prompt)

    # Make streaming request to ollama 
    with st.chat_message("assistant"):
        request_data["messages"] = st.session_state.messages
        response = st.write_stream(stream_content("http://localhost:11434/api/chat", request_data))
    # Add assistant response to chat history
    st.session_state.messages.append({"role": "assistant", "content": response})

When run, we are presented with a simple, elegant interface to chat with.

Integrating Audio Capabilities

Now, chatting with text is cool and all, but we could’ve accomplished this with ollama’s command line interface, albeit sacrificing some aesthetic factors. We can truly bring our app to the next level by integrating audio capabilities using Deepgram’s API.

In order to use Deepgram’s API, we must create an API key from our dashboard page. On the left sidebar, click on “API Keys” then “Create a New API Key”. Call the key however you’d like but make sure to record the key somewhere before the creation process finishes as once clicked out of the key creation window, the contents of the key will be inaccessible.

Since streamlit does not have audio recording capabilities, we need to install an external widget created by the community in order to receive audio input.

pip install streamlit-audiorecorder

In addition to audio, we also want to keep the ability to interact with the chatbot using plain text.

audio = audiorecorder("Record", "Stop")
prompt = st.chat_input("What is up?")

The “record” string will be the text displayed on the record button while the “stop” string will be displayed on the stop recording button.

Upon receiving the audio, we will employ Deepgram’s API to convert it into text and assign the transcribed text to the ‘prompt’ variable. If no audio was recorded but a text prompt was submitted, we will set the prompt variable to whatever was inputted into the chat box.

final_prompt = None
if prompt:
    final_prompt = prompt
elif audio:
    # use deepgram to transcribe audio
    final_prompt = "whatever that was transcribed"

if final_prompt:
    # Add user message to chat history
    st.session_state.messages.append({"role": "user", "content": final_prompt})
    # the rest of the logic for LLM response

To transcribe the audio, we will first export it into a wav file and then use deepgram’s python sdk to transcribe the local file.

elif audio:
    # adds a unique id to the audio file in case the user wants to keep them
    id_ = str(time.time())
    audio.export(f"audio{id_}.wav", format="wav")

    with open(f"audio{id_}.wav", "rb") as file:
        buffer_data = file.read()

    # deepgram config and api keys
    config: DeepgramClientOptions = DeepgramClientOptions()
    deepgram: DeepgramClient = DeepgramClient("YOUR API KEY", config)

    payload: FileSource = {
        "buffer": buffer_data,
    }

    options: PrerecordedOptions = PrerecordedOptions(
        model="nova-2",
        smart_format=True,
    )

    response = deepgram.listen.prerecorded.v("1").transcribe_file(
        payload, options, timeout=httpx.Timeout(300.0, connect=10.0)
    )

    # retrieves the fully puncuated response from the raw data
    final_prompt = str(response["results"]["channels"][0]["alternatives"][0]["transcript"])

And that’s it for the transcription! We can actually run the app now and be able to record an audio, or type a prompt into the input box to receive generated response from the model.

However, notice that the default placement for the record button is at the top of the screen, which can cause some styling issues when the amount of messages stacks up and overlaps with the button. We can actually inject some css “hacks” directly into streamlit to target our record button, an “iframe” element, and place it at the bottom of the page, preventing it from interfering with the messaging interface.

To do so, we add the following code right below the definition of our “stream_content” function.

style = """
<style>
iframe{
    position: fixed;
    bottom: -25px;
    height: 70px;
    z-index: 9;
}
</style>
"""

st.markdown(style, unsafe_allow_html=True)

Now, onto the text to speech part. We can transcribe the generated text to audio using deepgram’s python sdk and subsequently display it using streamlit’s audio component. We will initialize another instance of deepgram’s client using the same API key and create a configuration dictionary. Following, we will generate an audio response and save it to a .wav file, which will be displayed by streamlit’s audio player.

if final_prompt:
    # Previous code remain the same 
    # Make streaming request to ollama 
    with st.chat_message("assistant"):
        request_data["messages"] = st.session_state.messages
        # Generate a text response first
        response = st.write_stream(stream_content("http://localhost:11434/api/chat", request_data)) 
        deepgram: DeepgramClient = DeepgramClient("7afb54433a9c4e2aee392af9160ab4a1a67c9915")
        # Configure the TTS
        options = SpeakOptions(
            model="aura-luna-en",
            encoding="linear16",
            container="wav"
        )
        # Unique identifier for the audio response file
        id_ = str(time.time())
        # Call the save method on the speak property
        _ = deepgram.speak.v("1").save(f"audio_response{id_}.wav", {"text": response}, options)
        # Display audio using streamlit
        with open(f"audio_response{id_}.wav", "rb") as file:
            audio_bytes = file.read()

        st.audio(audio_bytes, format='audio/wav')
    # Add assistant response to chat history
    st.session_state.messages.append({"role": "assistant", "content": response})

Deepgram offers 12 different voices to choose from, and their corresponding identifier can be found on the documentation page. To use a different voice, simply replace the model field in the “SpeakOptions” configuration.

Congratulations on making it to the end of this journey! By now, you’ve armed yourself with a chatbot that not only understands and speaks but does so with a flair unique to you.

Running your chatbot is the grand finale, the moment where all your hard work pays off. Just like before, open your terminal, navigate to the directory where your Streamlit app resides, and type in the magic command:

streamlit run your_app_name.py

Replace your_app_name.py with the actual name of your Python file. Press Enter, and voilà! Your default web browser will spring to life, presenting you with the user interface of your chatbot.

Interact with it, talk to it, and marvel at how it responds, both in text and voice.

Bonus: Turning Your Siri into Intelligent LLMs

For users with an Apple device that supports Siri, we can actually use Apple's built-in shorcuts app to enhance Siri’s response by utilizing a Large Language Model. There’s an excellent shortcut created by Alex Kolchinski that invokes the OpenAI API to turn Siri into GPT-3.5. We can modify the shortcut to use the ollama API and run LLMs locally for free.

To download the shortcut, head to this page and click “get shortcut”. The web page will automatically load the shortcut in the Shorcuts app. To edit, right click on the shortcut icon and select “edit”.

Within the editor, the only action that we need to modify is the API call, changing it from OpenAI to the ollama API. We also need to adjust the message extraction actions, conforming to ollama’s response format.

In the above action, we modified the request url to the ollama API and removed the authorization header key as the ollama API does not require an API key. In the request body, make sure to change the value of the model key to a model that has been created through a modelfile. We also need to set the “stream” key to false.

Following the API request, the above actions extract the results from the returned json object. In some cases, removing the middle action that gets the “choices” key may disconnect the variables that are used afterwards. To ensure everything works properly, it’s best to remove the last actions and re-enter them with the correct keys and values.

To use the shortcut, activate Siri followed by the name of the shortcut, “GPT Mode” or whatever it was renamed to. And there you have it! A free upgrade to Siri in minutes.

As we wrap up this tutorial, it’s clear that the possibilities are as vast as our imagination. Whether you’re a seasoned developer or a curious newcomer, the journey doesn’t end here. Experiment with different LLMs, tweak the UI, or explore new APIs. The ease of access and usage is beyond expectations, allowing for the advent of exploration and experimentation using LLMs.