From Zero to Python Hero: AI-Fueled Coding Secrets Exposed with Gorilla, StarCoder, Copilot, ChatGPT
Jose Nicholas Francisco
If you’d rather watch than read, check out the video version of this blog here:
As they say on AI Twitter: “AI won’t replace you, but a person who knows how to use AI will.”
That’s right. The very engineers who forged AI from the fiery depths of linear algebra are now (at least tacitly) required to use their creation… lest they fall behind in the technological innovation race.
In other words, having an AI assistant help you code boosts your productivity immensely. Not only does the AI write quick, trivial functions for you, but such models can also help you debug and comment your code near-instantly.
That being said, there’s one problem: Numerous AI coding assistants exist, and we don’t know which one is the best. It’d be rather difficult and time-consuming to test them all. There’s just too many. Not to mention, we need to take into consideration the cost of each one—both in terms of money and onboarding-time. After all, the learning curve for these AI tools should be reasonably shallow. The more intuitive a tool is to use, the more attractive it becomes.
So today, I’ve tested some of the most popular AI coding assistants, so that you don’t have to! More specifically, I’m testing out
At the end, I’ll give you my recommendation for which AI model (or combination of models 👀) is best. Though if you have thoughts of your own on the matter, I’d love to hear them! (Tweet at us @DeepgramAI 😄)
Note: This blog post is *not* sponsored by any of these models or their respective companies/labs. This is simply my opinion. I ran this experiment out of legitimate curiosity and deep personal intrigue.
How I tested these models
As a proud citizen of Silicon Valley, I decided to go down the classic “Coding Interview” path to test each of these models’ skills. Yes, coding interviews aren’t a perfect means of assessing someone’s programming ability. But hey, it’s the best we’ve got.
So I asked each of the AI models the same two coding questions. Both in Python. The first question is a classic logic problem that requires no prerequisite knowledge of libraries or fancy algorithms. Meanwhile, the second question tests each model’s ability to use an API.
In particular, the two questions I asked are the following:
Write a function that removes all the vowels from a string that the user inputs. Your function must ask the user to type the input string. Assume the string is at most 2000 characters long, but otherwise has no restrictions.
Write a function that takes as input the name of an audio file and uses the Deepgram API (with help from the Deepgram Python SDK) to transcribe it. This function should print the JSON output to the console.
For reference, here are my solutions to the problems. First, to remove vowels from a user-input string:
And second, to transcribe a pre-recorded audio file:
You can find that ^ code snippet in the Deepgram SDK, by the way 😉.
Alright, now that we have our questions and answer key, let’s see how each AI coder performs:
StarCoder caught the eye of the AI and developer communities by being the model that outperformed all other open source LLMs, boasting a score of 40.8 percent on the HumanEval benchmark, which is higher than even some bigger models.
Whichever method you choose, StarCoder works in the same way. It simply auto-completes any code you type. In fact, all I did to test StarCoder was write the following comment in VSCode:
# A function that removes all the vowels from a string that the user inputs
And after a few seconds, the model auto-completed with the following code:
… And I’ll admit that StarCoder’s solution is better than mine.
Yes, our functions do the exact same thing and follow the exact same logic—even down to the “aeiouAEIOU” string. However, I believe StarCoder’s code is better since it uses more descriptive variable names than I do, and it included line-by-line comments that aid programmers of all levels to understand the function’s flow.
Long story short, if StarCoder were interviewing with me for a SWE role, I’d give it the job. (Or at least let it move onto the next round of interviews.)
And speaking of the next round of interviews, the image below reveals how StarCoder performed on the second coding question. Once again, all I did to elicit this response from the AI was write the big green comment. All other in-line comments were written by the StarCoder itself.
Yes, it’s quite a bit of code. And if you don’t already know how the Deepgram API works, you’d have to check this output against some documentation. However, let me spare you the effort: StarCoder’s code won’t work. The most apparent bug here is the fact that it doesn’t use Python’s built-in open() function to read the file specified by file_name.
Nevertheless, the model does provide a good starting point for developers who are new to the Deepgram API. So instead of writing the entire function from scratch, a developer will simply need to do some minor debugging to get StarCoder’s code to become fully functional. And a particularly crafty developer would ask another AI to debug this code for us. But more on that later.
If you’re curious what this debugging would look like, by the way, see the image below. The fix simply boils down to using open() and pretty-printing the JSON.
Overall, StarCoder did pretty well! It didn’t write perfect code every time, but that’s okay. Most humans can’t even write code perfectly every time, anyway. StarCoder works well for writing simple functions. But you can also use it if you’d rather spend your time debugging pre-written code than writing entire methods from scratch.
It takes about five minutes to see the two biggest differences between Github Copilot and StarCoder. The first is the price 💰.
Yeah… Copilot is going to ask to see your wallet before helping you with anything. Ten bucks a month or a hundred per year.
But once you get past the price barrier, you’ll see the second difference between Copilot and StarCoder: The latency.
For the vowel-removal question, Copilot seemed about 2x faster than StarCoder when autocompleting. These speed differences may vary for you depending on your machine and setup. But in my experience (and my colleagues’), Copilot is quite consistently faster than its astral counterpart.
And if you’re curious what Copilot comes up with for that question, here ya go! 👇
Copilot’s approach is different (and admittedly less efficient, due to the use of the .replace function) than StarCoder’s solution and mine. However, it still achieves perfect functionality.
And when it comes to the API transcription question, Copilot runs into a similar issue as StarCoder:
It doesn’t use open() and it gets the name of the transcription function wrong. It should be using deepgram.transcription.sync_prerecorded() rather than the non-existent deepgram.transcribe().
That being said, Copilot follows sound logic in the main() function. It first initializes an instance of Deepgram using the API_KEY constant. Then it attempts to use some API call to transcribe the file specified by the FILENAME constant. And finally, it prints the response to the console.
While this logic is sound, it’s clear that Copilot didn’t read the documentation or the SDK mentioned in the comment. However, much like StarCoder, Copilot provides a good starting point for developers, especially those new to the Deepgram API.
Fun fact: Copilot is built on top of a GPT-3 model… meaning its “brain” isn’t that far off from ChatGPT’s.
But how does ChatGPT itself compare?
To test ChatGPT, I simply used the classic OpenAI interface. The prompt I engineered reads, “You are an expert software engineer with expertise in Python. You also know how to use various APIs such as Deepgram. Your job is to help me write code and explain what the code we write means. I will ask you a coding problem, and you will respond with Python code that accomplishes the task I’m trying to achieve.”
ChatGPT’s output is unique in that it actually decomposes the task into two separate functions. The first is a main function that takes care of (1) asking the user for input and (2) printing the JSON output. The second is a helper function called remove_vowels(string). All together, ChatGPT’s results look like this:
And, as expected, it’s correct!
As for the (clearly harder) API question, ChatGPT falls into the same traps as its cousin models above. But to its credit, ChatGPT explicitly mentions pip installing dependencies, which no other models did. Not to mention, ChatGPT also parsed the output JSON correctly. It’s clear that ChatGPT doesn’t know the output JSON’s shape, so we must therefore conclude that it at least copy-pasted from the SDK or Deepgram documentation. Impressive!
However, ChatGPT’s usefulness doesn’t merely lie in its ability to code. In fact, I’d argue that ChatGPT is best utilized not for its code-writing skills but rather its debugging skills.
Specifically, whenever I’m too tired (or, let’s be honest, too lazy) to find bugs that I wrote myself, I will head over to ChatGPT and ask it to find my bugs for me. And it works! For example, I wrote some buggy code for the vowel-removal problem, and asked ChatGPT to point out what was wrong.
Here’s the buggy code. Bonus points if you can pinpoint what’s wrong with it:
Spoiler alert: The bug is that I am replacing every vowel with a space rather than the empty string. And when I ask ChatGPT to debug this code for me, not only does it explain what’s wrong but it also rewrites the code to make it work!
Last but not least, we have the newest kid on the block, Gorilla. This language model was developed by Berkeley and Microsoft.
What makes this LLM so special is that it specializes in making API calls. As of right now, however, it is only limited to APIs from hugging face and other similar resources. So it won't be able to use the Deepgram API.
But still, it's still an incredible resource, and you can try it in under 60 seconds! No API Key needed. Just check out this Google Colab. They install all dependencies for you within the Colab environment. And you can see Gorilla in action, not only writing API calls that can, for example, translate text. But it also describes the code and credits the API provider explicitly.
While it is still limited in scope, Gorilla shows a promising future for AI models who code. Unlike the other models in this video, Gorilla is a finetuned version of LLaMA, not GPT. That is, Gorilla’s brain stems from Meta’s collection of large language models—whose size ranges from 7 billion to 65 billion parameters.
This difference in models may slightly affect performance, but as of right now we don’t have enough information to conclude whether a LLaMA model or a GPT model is better suited to write code.
All we know is that both models are capable of filling in the blanks—aka autocompleting—any block of code that isn’t fully fleshed out.
Conclusion and Recommendations
Alright, that was a lot of information. Let’s take a step back. We’ve got some questions to answer.
What have we learned from these results? What is the most practical use of these models today in 2023? And what are my recommendations?
Well, first thing's first: Do not use AI to teach yourself how to code. If you do, you'll be as bad of a programmer as a recently released AI model. And since we're only in the *beginning* of the AI revolution, understand that these models are baby programmers. They're novices.
Learn from the experts.
That is, AI models that can code are useful to experienced programmers who want to accelerate the pace at which they already develop.
My personal recommendation is to use Github Copilot with ChatGPT on the side to help you debug. You can use StarCoder with ChatGPT, but I'm not a big fan of the latency. However, you should opt for this option if you don’t want to spend the money on Copilot.
And, of course, I'm still waiting for Gorilla to take itself out of the HuggingFace bubble.
Nevertheless, these tools not only accelerate my work, but they also make my code more readable by virtue of their descriptive variable names and their in-line comments. After all, computer code is meant to be written *by* humans *for* humans so that we programmers can communicate our ideas with each other as simply as possible.
Now, if you'll excuse me, I'm going to take a break from coding and rejuvenate my brain. As much as I love AI, I love caffeine even more.