Article·AI Engineering & Research·Oct 11, 2024

Codestral 22B, Owen 2.5 Coder B, and DeepSeek V2 Coder: Which AI Coder Should You Choose?

Zian (Andy) Wang
By Zian (Andy) Wang
PublishedOct 11, 2024
UpdatedOct 10, 2024

As the open-source LLM space grows, more models are becoming specialized, with “code” LLMs becoming extremely popular. These LLMs are intended to be smaller than their “general knowledge” counterparts but aim to exceed the coding performance of larger, general-purpose models.

These models offer the capabilities of larger models at a fraction of their costs, further democratizing the local LLM space. Particularly, three models in the smaller coding LLM space outshine their competition: Codestral 22B, DeepSeek Coder V2 Lite 14B, and Qwen 2.5 Coder 7B.

Codestral 22B was released on May 29th, the first code-specific model Mistral has released. It is said to be fluent in more than 80 programming languages with Fill-in-the-Middle ability to act as an assistant alongside the developer.

Qwen 2.5 Coder 7B was released on September 19th, 2024 by Alibaba Cloud. It is part of their Qwen series, with models ranging from 1.5B to 32B parameters, targeting performances closer to closed-source models.

DeepSeek V2 Coder was released in June 2024 by DeepSeek AI. This model is an improved version of DeepSeek V1, trained with 1.17 trillion code-related tokens, and it focuses on enhanced code generation and math capabilities with support for Fill-in-the-Middle as well. Alongside the “base” model with 236 billion parameters, they also released a smaller “lite” version with 16 billion parameters.

Comparing the Numbers

All three models boast state-of-the-art performance in their respective parameter categories, and the numbers are fairly impressive. Let’s take a look at the performance of the models on the most popular programming benchmark: HumanEval.

Codestral scores 81.1% and DeepSeek Coder v2 Lite scores 81.1% as well, while Qwen 2.5 Coder 7B boasts an 88.4% on the benchmark, surpassing both models that are much larger than itself. For reference, the closed-source GPT-4 from OpenAI only scores 87.1%, while the improved GPT-4o scores merely 2 percentage points above Qwen 2.5 Coder at 90.2%.

Another notable benchmark is Spider, which contains more than 10,000 questions that match to more than 5,000 complex, cross-domain SQL queries. This benchmark is crucial for integrating LLMs in databases. This time, Qwen 2.5 Coder leads with a much larger margin, sitting at 82.0% while Codestral scores a mere 76.6%.

Below is a table comparing the benchmarks of all three models, with GPT-4o on the side as a reference:

Note: Codestral’s benchmark numbers were taken from DeepSeek Coder’s comparison with Codestral, which were higher than the “official” reported ones from Mistral

We see that purely from the numbers, Qwen 2.5 Coder 7B absolutely outperforms every other model and, in some cases, nearly matches the performance of GPT-4o.

Of course, numbers never tell the entire story. We need to test these models in real-world scenarios to get a sense of how they “perform” or “operate”.

For the experiments below, I will be running all the models locally on an M2 MacBook Air with 24GB of unified memory, using ollama with llama.cpp running under the hood. In terms of model optimizations, I will be using the Q6_K quantization for all models, which retains reasonable performance from the original model while allowing all of them to fit in my 24GB machine.

To get a sense of the model size and speed differences, here are their GGUF file sizes along with their tokens per second (t/s) running on my laptop:

  • Codestral 22B - 18GB~ (3.31t/s)

  • Deepseek Coder v2 Lite 16B - 14GB~ (8.35t/s)

  • Qwen 2.5 Coder 7B - 6.3GB~ (10.31t/s)

Classic Games

Here’s an edited version of your passage with corrections for grammar, spelling, and awkward phrasing:

Let’s start with a classic snake game in HTML, CSS, and JavaScript. I want to see the LLMs’ abilities in one-off coding tasks. For someone without any programming experience to guide the LLM or debug its outputs, how well can it serve to create something useful? Here’s the prompt that I used on all LLMs:

Write a basic, functional snake game in HTML, CSS, and JavaScript in one file. The player will control the snake using arrow keys.

Qwen Coder had absolutely no problem producing the code, completing the task perfectly. I even tried variations of the prompt multiple times to ensure that it wasn’t a fluke, but Qwen delivered every time.

Snake gameplay on Qwen’s implementation

Codestral, on the other hand, did successfully produce a working game but it was not without its own quirks and bugs. The collision detection isn’t exactly perfect, the speed of the snake is way too slow, and the growth of the snake is barely noticeable.

Snake gameplay on Codestral’s implementation

Deepseek showed disappointing performance. Despite multiple turns of conversation, it was unable to produce a working implementation of a simple snake game. Deepseek used the “addEventListener” method to listen for a keypress, but the keypress never registered in any of its implementations due to bugs in the movement code.

I also tried prompting the LLMs with other, more complex games such as 2048, Minesweeper, and Tetris. Nearly all LLMs failed at every one of these more “complex” prompts, providing code that’s barely functional, with the exception of Qwen 7B Coder, the smallest of all.

In my testing, Qwen struggled with complex games such as 2048 and Tetris, but the resulting code was usually half-functional, such as half of the movement working in 2048 or the falling blocks in Tetris working but nothing else.

For Minesweeper, however, Qwen was able to consistently produce near-perfect implementations. Below is one of the aesthetically better implementations and a (sped-up) playthrough by me.

Minesweeper gameplay on Qwen’s implementation

Here’s a summary of results:

Complex Code

Most of these models, particularly Codestral, excel at Python coding, surpassing their proficiency in other languages. I came up with a pretty challenging request, something I actually needed for my own Machine Learning projects.

Unlike a typical “one-off” task for an entire functional game, I just wanted two specific functions. Let’s see how they did.

Here’s the prompt that are given to all models:

"""

Create two Python functions:

1. `five_crop(images, masks)`:

    - Input: Image and mask batches, shape (b, 1, h, w)

    - Output: Cropped images and masks, shape (5*b, 1, h/2, w/2)

    - Perform five-crop to the image, cropping out the four corners and a center image

    - The size of the cropped image is always half the dimension of the entire image, assume the original image's dimension is divisible by 2

    - Assume all cropped and original image is square shaped where h == w

2. `reconstruct(predictions)`:

    - Input: Predictions for cropped masks, shape (5*b, 1, h/2, w/2)

    - Output: Reconstructed predictions, shape (b, 1, h, w)

    - Stitch crops back to original size

    - Average overlapping areas

Note: Both functions should handle arbitrary batch sizes and single-channel inputs.

"""

Expected output

Qwen 2.5

Qwen 2.5 Coder 7B started strong. Its five_crop function worked flawlessly but there were some troubles with the reconstruct function. Qwen seemed confused about the exact contents of the cropped masks despite multiple revisions.

The input to the reconstruct function, which is the output that the five_crop function produced, is a tensor containing the cropped images. Every b image in the first dimension represents a different crop, beginning with the top-left portion of the first image, followed by the top-left portion of the second image, and so on. While Qwen appeared to grasp the logic behind image reconstruction, the indices used to extract the crops remained problematic despite repeated hints and revisions.

Qwen’s implementation

DeepSeek Lite

Deepseek also nailed the five_crop but its implementation is a lot more tedious and less clear. I prefer the Qwen 2.5 implementation much more. For the reconstruct function, the code ran without errors but the image was far from a reconstruction, much like an incorrectly solved jigsaw puzzle. Additional hints and nudges provided no improvements.

Deepseek Lite’s implementation

Codestral

Codestral’s five_crop function was flawless and its readability surpasses Qwen’s, using torchvision’s crop function to crop out each region instead of the messy indexing that Qwen used. Similar to other models, it had some difficulty with the reconstruct function, but after clarifying what the leading dimension of the cropped images contains, Codestral gave a flawless implementation on the first try.

Codestral’s final implementation

“Fundamental” Knowledges

Along with longer, more complex coding tasks, I also prompted the LLMs with some basic, short questions on math and Python knowledge. These questions test fundamental concepts and assess the LLM’s ability to provide precise, accurate answers without unnecessary elaboration, which serves as building blocks for completing more complex requests. Moreover, they can reveal unexpected weaknesses in an LLM’s knowledge base or reasoning processes.

  1. Is 31793 a prime number?

  2. Instead of directly answering, Qwen actually tried to “emulate” a Python interpreter, writing down some code then hallucinating with an incorrect “False” output.

  3. Deepseek answered incorrectly with a long string of tests checking divisibility number by number and said 31793 is divisible by 19.

  4. Codestral tried to write some Python code to test the number but gave a direct, correct answer at the end of its response: “yes”.

  5. Write a Python function to reverse a string without using the built-in reverse() method.

  6. Qwen wrote a functional piece of code but instead of using the shortcut [::-1], it looped through the list in reverse. It did suggest the shortcut implementation as an “alternative”.

  7. Deepseek produced functional code as well but did not even suggest using the shortcut [::-1]. Instead, it cleverly looped through the string and appended each character to the front of the list, a more concise implementation than Qwen but not perfect.

  8. Codestral used the [::-1] slicing trick and gave a concise, correct response.

  9. Calculate the area of a circle with a radius of 7.5 units. Round your answer to two decimal places.

  10. Qwen calculated the area accurately to the hundredth place, 176.71, with an unnecessarily long explanation.

  11. Deepseek gave nearly the same result as Qwen: a long explanation but correct answer of 176.71.

  12. Codestral gave a concise, accurate answer: 176.71.

  13. What is the result of XORing the binary numbers 1010 and 1100?

  14. Qwen answered incorrectly, “0100”, and attempted to write Python code that it can’t run to solve the problem.

  15. Deepseek answered correctly with an unnecessarily long explanation.

  16. Codestral answered correctly with a moderately long explanation.

Overall, it looks like the larger models are slightly more “intelligent”. The increased parameter size gives them more “raw” power to recall knowledge and understand context. The smallest model, Qwen, frequently wrote and pretended to run Python code instead of directly answering the question. Both Deepseek and Codestral were better at only writing code when called for. Surprisingly, only Codestral gave the most concise and widely-adopted implementation of reversing a string.

Here’s a summary result of all the tests done in this article across the three models:

Which One Should You Choose?

Without considering computing costs, I would recommend choosing Codestral for python tasks and Qwen for other languages.

Although Codestral didn’t excel in the tests with browser games, its larger parameter count just makes conversation feels “smoother” as it understands language much better than smaller models. Talking to Codestral feels more like a conversation with ChatGPT while both Qwen and Deepseek felt like they only knew how to speak in “code”.

If memory constraints and speed is a limiting factor, then Qwen is a no brainer. It’s smarter than Codestral in some cases and requires less than half the memory compared to Codestral, coming in at only 6.3 GB for the q6 quant. It can easily be run on most GPUs and all Apple Silicon Macs at a decent speed.

Of course, this probably won’t be the verdict for long, as Alibaba, the company behind Qwen, plans to release a 32B version of the coder, targeting to match state-of-the-art closed source performances. Lower quants of 32B models will fit in a machine that can run Codestral. The open-source coding LLM space is nothing less than electrifying.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.