Article·AI Engineering & Research·Aug 28, 2024

Improvement or Stagnant? Llama 3.1 and Mistral NeMo

Zian (Andy) Wang
By Zian (Andy) Wang
PublishedAug 28, 2024
UpdatedAug 27, 2024

With the rapid development of Large Language Models (LLM), the emergence of new models boasting improved performance has become the norm. Two well-known open-source series, Llama and Mistral, have recently introduced new stars to their lineups: Llama 3.1 and Mistral NeMo.

Llama 3.1

Released on July 23rd, Meta’s Llama 3.1 aims to succeed Llama 3 while expanding its existing line of models. The Llama 3.1 family includes six models:

  • Llama 8B

  • Llama 8B Instruct

  • Llama 70B

  • Llama 70B Instruct

  • Llama 405B

  • Llama 405B Instruct

The numbers indicate the parameter count, with the instruct versions being fine-tuned using Supervised Fine-Tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF) for better helpfulness and safety.

Interestingly, unlike models of similar sizes, the Llama 3.1 405B is not a mixture-of-experts model but rather a single, decoder-only transformer.

Simplified Architecture of the Llama 3.1 Models

As usual, since Llama models are free to download, use, and fine-tune with few limitations, there are “uncensored” versions of Llama 3.1 models fine-tuned without the consideration of safeguards and censorship. The most notable one is the Dolphin fine-tune from Cognitive Computations.

All the models were trained with more than 15 trillion tokens, and the default context length out-of-the-box sits at an impressive 128 thousand tokens.

To run the models locally, the most convenient method is to use ollama and download the GGUF version of the respective model from HuggingFace. Here’s a handy tutorial on how to do it, along with some bonus functionality and a UI interface.

All models in the Llama 3.1 family can be used with Meta AI, integrated into Instagram, WhatsApp, Facebook, Messenger, and Meta’s webapp. The models are multilingual, and those on Meta’s website are also multimodal, processing images, analyzing files, and generating images with an integrated diffusion model.

The Llama 3.1 405B claims to rival top AI models when it “comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation”.

Mistral NeMo

Not to be outdone, Mistral, the other giant in open source LLMs also released a new iteration of the model. In collaboration with NVIDIA, the Mistral team has released a new model called Mistral NeMo on July 18th.

With increased demand for small to mid-range models, this 12B parameter model is designed to be a drop-in replacement and improvement for the previous Mistral 7B model. Due to the increase in parameter count, the 12B model is expected to be fundamentally better than its 7 billion counterpart.

This 12B parameter model fills the gap between small 8B models and large 70B+ models, offering a powerful yet manageable solution for those who find 8B insufficient but 70B too resource-intensive.

Similar to Llama, Mistral NeMo’s pre-trained base and instruction-tuned models have been made available to the public under the Apache 2.0 license. This multilingual model boasts an impressive 128k token context window, making it capable of handling more complex and lengthy tasks.

One of the most interesting aspects of Mistral NeMo is its use of a new tokenizer called Tekken. This tokenizer compresses text more efficiently than its predecessors, particularly for languages like Korean and Arabic, as well as source code. In fact, Tekken’s compression improvements in these areas is more than double, if not triple, that of previous tokenizers.

Mistral NeMo is seamlessly integrated with the NVIDIA ecosystem and can be easily deployed using NVIDIA NIM. Like Llama, the model can be downloaded from HuggingFace, along with uncensored versions from Dolphin and other sources.

Llama 3.1 and Mistral NeMo: The Numbers

Although benchmarks and metrics never tell the whole story, they still act as a comprehensible way to gauge the performance of a LLM at a glance. With that being said, let’s breakdown the numbers from Llama 3.1 and Mistral NeMo.

Llama 3.1’s metrics rivals other models of similar size. However, its 8 billion model comparison did not include Mistral NeMo.

Performance metrics for Llama 3.1 8b and 70B

The medium-sized model in the family, Llama 3.1 70B, blows GPT-3.5 and Mistral 8x22B out of the water in all aspects of performance while utilizing a significantly lower parameter count.

On the other hand, Mistral NeMo also boasts impressive numbers but again lacks the comparison with Llama 3.1. Granted that Llama 3.1 wasn’t released at the time Mistral NeMo was.

Performance metrics for Mistral NeMo 12B

From the official benchmarks, there’s only one metric in common, MMLU, a multiple choice question set spanning over 57 academic areas aimed at testing LLMs general knowledge.

Llama 3.1 8B, scoring a 73, completely tops Mistral NeMo, who only scored a mere 68%. In fact, Mistral NeMo’s MMLU score was obtained with 5-shot, meaning the model has seen 5 prior examples before answering questions while Llama 3.1’s score was obtained with zero-shot, no prior information.

Although some users suggest that Mistral’s overall ability beats Llama 3.1. Again, metrics and numbers never tell the whole story and definitely fails to encompass all real-world use cases.

Llama 405B Shines

Although not the focus of the comparison, Llama 405B’s performance steals the stage from its “little brothers”. The benchmarks that Meta has given only compares it to subpar models and Llama 405B has not issue beating all of them.

Interestingly, instead of raw numbers, Meta pitted the 405B model against the current state-of-the-art models and let humans evaluate its performance: whether it “won”, “tied”, or “lost”.

Llama 3.1 405B human evaluation results

The data shows that Llama 3.1 is at, if not better than the current, closed source LLMs from both Anthropic and OpenAI. We see that Llama 405B even slightly edges out Claude 3.5 Sonnet. Despite the higher parameter count, this is the first time that an open-source model achieves the level of the best of the best, regardless of parameter size.

Llama 405B is a huge win for the open access of Large Language Models. Although it being open source does not mean anyone can run it on their laptops–the FP16 version of Llama 405B requires more than 800 gigabytes of VRAM–but the model isn’t behind closed walls either.

With such technologies being open source, not only will its usage and inner workings be more transparent, the nature of open source brings together collaborative communities and people to further improve the model for many different use cases.

ChatGPT’s Killer Voice Feature

Llama 405B’s success does not mean the downfall of closed source models like ChatGPT either. The impressive, life-like voice agents that were shown off on the release of GPT-4o are finally rolling out to the public for early alpha testers.

GPT-4o provides users with a “voice” mode to chat with the model, and unlike plain text-to-speech, GPT-4o can express realistic emotions, handle real-time conversations, can be interrupted, and much more. In fact, it was so realistic that OpenAI actually had to remove one of its voices for sounding too much like Scarlett Johanssons’s role in the TV show “Her”.

Can LLMs Pass a Spelling Test?

Like mentioned above, benchmarks and metrics can never tell the whole story, failing to encompass real-life use cases. To test how well Llama and Mistral actually performs, and which is better, there are a couple prompts that I have gathered.

For all the testing, I will be running mistral-nemo and llama 3.1 8B locally using ollama on the basic text interface in the terminal of my mac.

To start, I decided to test if Llama and Mistral had the education of an elementary student, specifically, spelling tests.

Not long ago, I did a little spelling experiment to debunk a myth I heard: state-of-the-art LLMs can’t count the number of e’s in the word “ketchup”. Or I thought I was going to debunk it. Instead, to my surprise, and horror, both ChatGPT (3.5 at the time) and Bard consistently failed and hallucinated the spelling and the number of e’s in the word “ketchup”–claiming there to be none, or two instances of the letter.

Nearly a year later, both Google and OpenAI’s LLM has gotten better to avoid such silly mistakes, but it doesn’t mean these smaller models won’t trip on the same rock.

Mistral NeMo answering the spelling question

Llama 3.1 8B answering the spelling question

Looks like Llama 3.1 passed without any hiccups but Mistral answered incorrectly first, then corrected itself immediately after. Not bad.

Do These LLMs Still Hallucinate?

Hallucinations are a common phenomenon to even the best of LLMs, but just how bad could it get? I prompted both models with an absurd, obviously false scenario–that Oprah Winfrey sent personalized rubber ducks to every world leader in 2017–and tested if the models would play along by asking why she did so.

Llama 3.1 8B responding to the absurd event

Llama had no problem identifying the unrealistic event, directly stating that it never happened with confidence. Across multiple trials, Llama stayed consistent with its answer without playing along and coming up with an unrealistic explanation.

On the other hand, Mistral NeMo wasn’t so smart responding to the “test”.

Mistral NeMo responding to the absurd event

Mistral NeMo immediately responded with confidence, affirming the occurrence of the event. It even connected the rubber ducks to the show Belief, a real documentary television series. What’s even funnier is the fact that Belief was released in 2015, prior to the made up fiction rubber duck event. But the model hilariously stated that the rubber ducks distributed in 2017 was tied to the “launch of a new show called ‘Belief’”.

Since Llama had multiple attempts to prove its success, I gave Mistral NeMo’s hiccup the benefit of the doubt and asked the model again.

Mistral NeMo responding to the absurd event a second time

It looks like Mistral was actually able to recognize the untruthfulness of the event, but then started hallucinating about the origins of the rumor. The Facebook post described by Mistral is nonexistent as well as the links that it supposedly based its assumptions off of.

Counterintuitively, even though Mistral NeMo has more parameters than Llama 3.1, it looks like its tendencies to hallucinations are much more than Llama 3.1.

Of course, this doesn’t mean Llama 3.1 isn’t prone to hallucinations. In fact, even the best models, open or closed source, hallucinate fairly often. However, from the simple test I conducted, we can at least say Llama 3.1 won’t get too crazy with its imaginations.

Can Mistral and Llama Write Some Code?

Finally, I put both models to one of the more common use cases by people in their day to day lives: programming. I wanted a simple way to verify the results from both models. This method should not require installing packages or setting up environments. Environments and packages can negatively influence the results of the tests which do not reflect the abilities of LLMs.

Thus, I resorted to tasking the LLMs to write a basic, but functional game of pong in javascript.

Specifically, I wanted the game to be played by two players on the same keyboard, with player one using the w and a key to move the paddle and the second player using the up and down arrow keys to move their paddle. I also requested a scoring system and a restart button.

Here’s the prompt I used: “write a basic but functional and complete game of pong. This is a two player game where player one will control their paddle with the key w and the key while player two will control their paddle with the up and down arrow keys. There should be a score and a restart button. the game should be written in html and javascript.”

Here’s the result from Mistral Nemo:

Although the graphic is very basic and far from aesthetic, it looks like everything functions correctly. The player scores a point when the ball bounces on the wall behind their paddle, and both paddles can indeed be moved with the respective keys.

The code is merely 80 lines of html and javascript and everything runs on the first go without having to adjust anything.

On the other hand, this time Llama 3.1 failed pretty miserably. The game still runs but is filled with bugs and weird behavior.

Instead of bouncing, the ball teleports to the other side when hitting the wall or the paddle. The scoring system is a little janky as well, giving the player a point when they successfully hit the ball with their paddle. In most versions of pong, a player only scores a point when the other player misses the ball. Overall, the game from Llama 3.1 was barely functional (code was not shown here since it was over 140 lines long!).

The results are fairly consistent across various trials with Mistral’s implementations almost always working on the first try, and even if it doesn’t, minimal modification was needed to get it working. On the contrary, Llama 3.1’s implementation almost never worked.

Conclusion

The release of Llama 3.1 and Mistral NeMo marks another exciting chapter in the open-source LLM saga. While both models bring impressive capabilities to the table, our hands-on testing revealed some surprising strengths and weaknesses.

Llama 3.1 8B showed remarkable common sense and fact-checking abilities, easily spotting the absurd “Oprah rubber duck” scenario that tripped up Mistral NeMo. This resistance to hallucination is crucial for real-world applications where accuracy matters. However, Llama 3.1 stumbled when it came to the practical task of coding a simple Pong game, producing a buggy mess that barely functioned.

Mistral NeMo, on the other hand, proved to be the coding champ. Its clean, functional Pong implementation worked right out of the gate, showcasing its potential for developer-focused applications. But its tendency to confidently state falsehoods about non-existent Oprah events is concerning and highlights the ongoing challenge of reducing hallucinations in LLMs.

Despite the flaws, both Mistral NeMo and Llama 3.1 saw a huge leap in open source models, especially on the medium to small sizes. They are very easy to use without absurd hardware or memory requirements. We’re in for a wild ride as these models continue to evolve.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.