Introduction

OpenAI’s dev day last month felt like another watershed moment in the LLM timeline and the current AI zeitgeist. The unveiling of new functionality and expansion of the current ChatGPT product, some new APIs, buffed up new versions of older models like Whisper, and others captured the internet’s attention instantaneously. As many startups and projects as OpenAI has extinguished with the rollout of new developments which solved the problems these niche projects were tackling, open source projects and models still persevere strongly. The early notion of one model to rule them all is quickly becoming obsolete. OSS models like the Llama2 family, Mistral models, Stable Diffusion models, and others all definitely have a seat at the table and the benchmarks to validate their position in the model landscape. 

The longstanding debate regarding the merit and impact of open source (OSS) versus closed source software has naturally translated over to the recent AI model developments. This discussion isn't just relegated to academia and the transparency regarding model architecture; it genuinely shapes the way software (and more broadly technology) is developed, deployed, and democratized. Even Facebook, a historically closed loop software company, is contributing to the open source economy with its work surrounding Llama models. OpenAI made Whisper open source due to popular demand. Predating LLMs (for all us millennials), Linux proved that open source software could largely dictate the construction of the internet by beating out Microsoft and Sun Microsystems. To say one of either open source or closed source software is the absolute correct path would be a very naive reduction of both schools of thought. 

There are several circumstances that can dictate which philosophy is the optimal development strategy. Are the builders technical or non-technical? Are they creating AI-native businesses or building out AI-features for an already existing, self-sustaining product? LLMs are, after all, business-critical software. On the one hand, open sourcing AI software offers more freedom and wide-spread perspective-led development, but a closed loop model may come with more stability and user-validated structure. Clearly this is a very nuanced topic, as we now have Google engineers claiming “they have no moat” and that larger companies are losing the AI race to open source development. Now these are probably hyperbolic claims, but it goes to show that the open source vs closed source debate is a complex, tangled one. Let’s take a deeper view at the juxtaposition between the two.

Contrasting Open Source and Closed Source in the LLM Landscape 

The Economic Edge of OSS Models 

OSS LLMs showcase a compelling economic appeal. These models, which include the likes of Llama2, Mistral7B, SDXL, Falcon, etc. not only promise cost savings in terms of optimized hardware infrastructure but also bring operational efficiency. As OSS model proliferation has continued, so has the progress of making the hardware they run on more cost-effective. This is evident in the reduced latency and minimized cold starts with these OSS LLMs, which are critical in real-time applications. Hardware instances are much cheaper (although a looming GPU bubble may have something else to say) to run models on with optimized architecture and interfacing layers. The open source paradigm allows for extensive validation and exploration, enabling developers to delve under the hood of these models. If developers only had access to an API, validation feedback loops can quickly accumulate and become expensive. With the autonomy open source delivers, inspecting these models through the inference flow allows for diagnosing any issues, installing certain checks and validations, adding tweaks for desired outputs, etc. This transparency fosters a deeper understanding and control over both the development processes and the data, ensuring that the models are not just effective but also adaptable to various contexts. Moreover, the cost-effectiveness of OSS models extends beyond the initial deployment. In the long term, these models facilitate ongoing modifications and improvements without the constraints of licensing fees or proprietary restrictions. With OSS models, a lot of domain-specific problems can be solved by fine-tuning the base model to increase its problem-specific efficacy. Fine-tuning an OSS model is much cheaper and more flexible than relying on a broader model’s closed loop API to solve a niche problem. Some closed source model companies also charge a much higher amount to run finetuning jobs with their models compared to just using their API. Controlling the entire stack of model instantiation is only getting cheaper as open source AI matures. This aspect is particularly beneficial for startups and smaller organizations, which might find the cost of closed source solutions much more of a capital constraint. 

The Practicality of Closed Loop Models 

In contrast, closed loop models, typically products of proprietary environments, offer a straightforward integration path. They require minimal configuration and are often plug-and-play, significantly reducing the technical barrier for adoption. This ease of use is especially advantageous for non-technical users or businesses looking to incorporate AI capabilities without extensive developmental resources. Furthermore, closed source models often come with dedicated support and continuous updates from their parent companies. This ensures that the models remain reliable and up-to-date with the latest advancements in AI, providing a level of stability and ongoing refinement that might be challenging to achieve in open source projects. 

The Community-Driven Innovation of Open Source 

Open source models benefit enormously from their communities. These communities, composed of developers, researchers, and enthusiasts, contribute not just to the refinement and training of the models, but also greatly expand their applicability. In open source projects, ideas and innovations are shared freely. This collaborative environment enables rapid iteration and experimentation, often resulting in models that are more versatile and capable of addressing a wider gamut of problems. This communal aspect also fosters a sense of ownership and investment among the contributors, driving further innovation. With each contributor bringing their unique perspective and expertise, open source models evolve in ways that might be unforeseen in a more controlled, closed source environment. Consolidated model development and product development within a centralized organization may not generalize to the public as well as an open source offering sometimes, simply because open source developers have strength in numbers amongst themselves. Being your own customer allows deeper understanding of the model use cases, driving the open source offering’s efficacy higher.

The Structured Approach of Closed Source Models 

Closed source models can however offer a more structured approach to AI development. These models are typically product-validated and market-ready, designed to meet specific business needs. The backing of a corporate entity often means that these models have access to extensive resources, technical expertise, and a ready market. The corporate backing also means that closed source models often come with comprehensive documentation, training, and support, making them more accessible to businesses that may not have extensive AI expertise. This structured approach can lead to faster deployment and integration into existing business processes, providing a clear pathway to leveraging AI technologies. Even though open source communities leverage their size, closed source development culture can succeed over an OSS model offering. As you can see, it’s very difficult to clearly see which of the two is superior.

Control and Security

OSS models epitomize the democratization of technology. They operate under the principle of collective contribution and oversight, creating a data flywheel powered by a diverse community. This democratic approach affords users greater control over the model's development and usage, allowing for transparency and adaptability. However, the very transparency that empowers users also makes these models more susceptible to exploitation. The question then arises: in an era where AI is becoming increasingly powerful, is it responsible to leave such potent tools unchecked in the open source domain? The risk of misuse or malicious modification of OSS models as they become more capable can grow more significant, even if it’s just currently doomer speculation. Just listen to OpenAI’s Ilya Sutskever’s take on the matter.

Contrastingly, closed source LLMs present a more controlled environment. The flywheel here is carefully managed. This controlled input leads to a more secure development path, minimizing any risks of data corruption or model misuse. However, this control comes at a cost. The closed nature of these models often means less transparency and user control over the model's workings and decision-making processes. Users of closed source LLMs rely on the assurances of the providing companies for security and ethical use, which might not always align with their beliefs. Data security is a huge issue in this day and age, especially with the exploitation of LLMs through jailbreaking methods and data leaks. It could be much more prudent for builders to have full control over their users’ data as opposed to feeding it with no oversight to a closed loop model. 

These points are just a microcosm of the larger debate. Clearly, the juxtaposition between OSS models and closed source models is anything but binary. It’s largely a bunch of gray areas that need much more empirical venturing to decide which path is best suited for the user. In the spirit of the developer experience, let’s try using both open source and closed source models and see what both entail! Since we are hosting a state of the art foundational ASR model, let’s observe the difference between using Deepgram’s closed source Nova-2 model and using OpenAI’s open source Whisper model. We’ll hack out some very rudimentary Python to understand what the onboarding experience and utility of both these models is like.

Basic Transcription with ASR Models

Deepgram - Nova

Deepgram’s API allows for transcription of either pre-recorded audio or a real-time audio stream. For the sake of simplicity and demonstration, we’ll be working with a pre-recorded audio file. First, we need to ensure that we have the necessary Python dependencies installed to perform transcription. We’ll need both the Deepgram Python SDK along with the FFMPEG library to handle any multimedia encoding and decoding in Python.

pip[3] install requests ffmpeg-python
pip[3] install deepgram-sdk --upgrade
  • Note: Use pip3 if that’s the version of pip you have installed on your machine

Now, let’s create a project directory to hold both of our transcription scripts for each model (we’ll call it “OSS”. We’ll also upload a pre-recorded MP3 file at the same level as our transcription script as such:

Now, we can write out our main transcription block. First, if you haven’t already, sign up for a Deepgram API key at https://console.deepgram.com/signup. If you run the code below (formatted to your directory and machine) and still receive errors, you may need to update the Deepgram Python SDK or check how many credits you still have remaining in your account.

We’ll insert the API key where prompted, specify the audio format, and optionally the directory at which the audio files are held (here we just have our audio at the root so no need to explicitly define an audio directory).

from deepgram import Deepgram
import asyncio, json, sys

We’ll import the packages above into our script to ensure the API returns a proper transcription response. Now we’ll declare our API key (remember to obfuscate this and not to commit any secrets to version control) audio file path, and the mimetype (file type) of our audio file.

# Your Deepgram API Key
DEEPGRAM_API_KEY = 'YOUR_API_KEY_HERE'

# Location of the file you want to transcribe
FILE = 'sample_audio.mp3'

# Mimetype for the file you want to transcribe
MIMETYPE = 'mp3'

We can now proceed with calling Deepgram’s transcription API. We’ll want to initialize the SDK by setting up the client, pass in the file path to our audio file, and then write the response to a json file for observation of our transcription results. We’ll also be using Deepgram’s state of the art ASR model Nova. 

async def main():
   # Initialize the Deepgram SDK
   deepgram = Deepgram(DEEPGRAM_API_KEY)

   # Check whether requested file is local or remote, and prepare source
   if FILE.startswith('http'):
       source = {'url': FILE}
   else:
       # Open the audio file and keep it open
       audio = open(FILE, 'rb')
       source = {
           'buffer': audio,
           'mimetype': MIMETYPE
       }

   # Send the audio to Deepgram and get the response
   try:
       response = await asyncio.create_task(
           deepgram.transcription.prerecorded(
               source,
               {
                   'smart_format': True,
                   'model': 'nova-2',
               }
           )
       )

       # Write the response to the console
       print(json.dumps(response, indent=4))

       # Write the response to JSON file
       with open('transcription_results_deepgram.json', 'w') as file:
           json.dump(response, file, indent=4)
   finally:
       # Close the file manually after the task is completed
       if not FILE.startswith('http'):
           audio.close()

We can call our asynchronous main function as such along with some custom traceback logging:

if __name__ == "__main__":
   try:
       asyncio.run(main())
   except Exception as e:
       exception_type, exception_object, exception_traceback = sys.exc_info()
       line_number = exception_traceback.tb_lineno
       print(f'line {line_number}: {exception_type} - {e}')

If the script executes with no errors, we’ve successfully transcribed the audio file and should see the results of the transcription echoed to the console as well as a JSON output file containing the results of the transcription and other data associated with it.

Further inspecting the transcription json file, we can see all the metadata as well as the actual textual results of the transcription. Deepgram’s API returns the entire transcript as a string, confidence scores for each word in the transcript, as well as the transcript split by sentence and paragraph. This is a pretty extensive response that provides a lot of information for many different transcription use cases. Overall, using the Deepgram API was pretty straightforward with minimal barriers of entry. The necessary code to call the API isn’t long winded, and the transcription results are quite accurate and rich with plenty of other insights. Obviously, the main consideration to weigh here is the cost of using the transcription API. Deepgram currently bills the Nova model at $0.0044 per minute of pre-recorded audio. At first glance that cost may seem negligible, but as different use cases build and scale up, it can grow significantly. 

OpenAI - Whisper

Now we’ll try out the same transcription task using OpenAI’s open source model Whisper. There are a few different ways to invoke the model. We could use the OpenAI client but in the spirit of taking the open source route, we’ll be using the Whisper python library. The package can be downloaded using: pip[3] install -U openai-whisper

Even though this is an open source model and it seemingly might require more blocks of supporting code to instantiate a function transcriber, the library is just as lightweight if not more than Deepgram’s SDK. In fact, it takes less code to request the model to kick off transcription with the Whisper library. After installing the Whisper library using PIP, import the following packages:

import whisper
import json

Next, let’s write the core transcription function. We’ll be passing in the name of the explicit Whisper model being used (we’ll be using the “base” model) as well as the file path of the audio file.

def transcribe_audio(model, audio_path):
   """
   This function takes a Whisper model instance and an audio file path,
   loads the audio, and performs transcription.
   """
   # Load audio file
   audio = whisper.load_audio(audio_path)
   # audio = whisper.pad_or_trim(audio)

   result = model.transcribe(audio)

   return result

The pad_or_trim() function can be used to shorten the transcription to just 30 seconds. Under the hood, the transcribe() function reads the entire audio file then uses a sliding window of length 30 seconds to traverse through the file. Each incremental window processes the audio and performs autoregressive sequence-to-sequence predictions on the audio to compose the overall transcription of the file. Whisper’s open source library provides a lot of custom functionality such as creating the log-mel spectrogram of the audio, translating the audio (Deepgram’s API has this capability as well), providing context with an initial prompt, more granular control over audio decoding, etc. 

Now we can invoke and run the transcribe_audio() function in our script as such:

def main():
   # Download and load the Whisper model
   model = whisper.load_model("base")
   # The path to audio file
   audio_path ='sample_audio.mp3'

   # Transcribing audio
   transcription = transcribe_audio(model, audio_path)
    # Write transcription results to JSON file
   with open('transcription_results_whisper.json', 'w') as json_file:
       json.dump(transcription, json_file, indent=4)

   print("Transcription:")
   print(transcription)

if __name__ == "__main__":
   main()

Here, we specify the file path to the audio sample as well as the model size (which can be adjusted to be one of “tiny”, “base”, “small”, “medium”, or “large” based on the audio file). Once again, the script should execute with no errors and will subsequently print the transcription results on the console as well as generate the transcription json file. Let’s inspect the response in the json file.

The entire transcription is returned along with the segmented transcription. Each of the segment objects contain the respective timestamps, token IDs, average log probabilities (how likely the segment sequence of tokens is), etc. Just like Deepgram’s model’s response, Whisper’s response is quite substantial with a rich amount of information returned. Similarly, both models are quite easy to invoke, with a low technical lift evidenced by the minimal amount of code required to receive a transcription of our audio sample.

Comparing the Two

So now that we’ve gone through the steps for basic usage of both of these speech-to-text models, some of the notions regarding the experience of using an open source and closed source model become a little more focused. Both models were relatively easy to onboard. The amount of code, as mentioned above, was quite minimal. Deepgram’s API required a bit more complex code (use of Python’s built-in asynchronous library, using an API key, etc) but because the documentation is explicitly clear, there wasn’t much of an onboarding lag here. Whisper’s model was quite easy to invoke as well. The documentation is quite extensive and the code required to run the model is minimal as well. This further reflects the fact that open source models do often have sufficient support behind them. 

Whisper is a unique case since it was initially a closed source model that converted to open source once OpenAI decided to do so. However, many open source models have a corporate backing (i.e. Llama models) and so it's reasonable to assume the onboarding experience for using these OSS models will have the same quality of documentation as invoking a closed source model’s API. With the added support of communal contributors to the Whisper project, it’s no surprise that the Python library is a well-encapsulated wrapper around the core ASR model. 

If we want to examine these OSS models through a more puritan lens then yes - the onboarding experience is most likely going to be much more complex. Hypothetically, we could’ve forked the Whisper weights from the main repository (the code and model weights were released under the MIT License), implemented our own model and transcription functions, and hosted the model in a proper model serving environment. This would’ve led to selecting a sufficient hardware instance to run the model (most likely using GPUs), configuring a load balancer to handle inbound transcription requests, and many more downstream infrastructure tasks. This obviously incurs more infrastructure costs than just locally tapping into an existing Python library, but for the sake of simplicity and expedition we went with the latter. 

Overall, both the models were quite simple to invoke. Both had pretty solid support from their parent company/the developer community. The costs to leverage either model do stack up in their own way (given how we use the open source model). Whisper and Deepgram’s respective open source and closed source models both have a relatively non-complex onboarding experience.

Future of Open Source

Both OSS and closed source models will continue to be conceived and evolve as more research and breakthroughs continue to occur. The debate will grow if anything, but after juxtaposing both classes of models and also getting some slight, tangential experience with them, it’s clear that neither is an absolute answer to the future of large-scale AI models. The structure and resources fueled by the capitalism of closed source models will make them a strong offering to those looking to innovate and build in the AI space without getting tangled in the development process. This inevitably leads to a further proliferation of these models, and spurs more incentive to innovate with AI. Open source models also achieve this but also offer the capability to own the entire model, customize and fine tune the model to the specific use case, and decrease dependency on another organization at scale. As hardware instances and model scaling techniques become cheaper, OSS models become much more feasible. As these open source models mature, safety will loom as a concern but shouldn’t be much of an issue. With proper guardrails in place, the potential malignancy of open source models won’t exist as a large deterrent. Both of these types of models will continue to develop at a fast pace, and the future will consist of a vast array of models to choose from. Pick the one which suits your needs best but be cognizant of the tradeoffs unique to your development circumstances. 

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.

Sign Up FreeBook a Demo
Deepgram
Essential Building Blocks for Voice AI