How to Speak AI, Revisited: An Intermediate Dictionary
Jose Nicholas Francisco
I see we’ve crossed paths again, fellow internet traveler 😺
In the first chapter of our AI dictionary, we covered terminology like “Neural Networks” and “Vectorization.” Now that we’ve covered the fundamentals, we’re going to delve a little deeper into the world of AI.
Specifically, we’re going to discuss real-world AI models, algorithms, and techniques. That way when you hear someone at a cocktail party say, “PyTorch may have a more efficient backprop implementation than Tensorflow,” you’ll know exactly what they mean. 🍸
Let’s get started!
[In]famous AI Models
ChatGPT: So this term may not need much of an introduction. ChatGPT is an AI model created by OpenAI that has hit the mainstream. It can hold a conversation with you, play Tic-Tac-Toe with you, write you a haiku, and even read code! You can speak with it here.
Bard: This is Google’s version of ChatGPT. While OpenAI’s chatbot is built on the impressive GPT-3.5, Bard is built on LaMDA. That is, both bots serve the same purpose—to generate text for you—but they have differently wired “brains”. Check out the comparisons here.
Bing Chat (aka Sydney): And here we have Microsoft’s chatbot. This one is already up and running for you to try (after you join the waitlist)! The goal of Bing Chat is the same as the goal of Google’s Bard. The bot is meant to be an “ask me anything” resource, potentially robust enough to replace its search engine predecessors. Note: Sydney has apparently been tested secretly for the past six years.
Dall-E (and Dall-E 2): This is OpenAI’s new system that can create realistic images and art based on a description that you type. As a demonstration, the text “An astronaut riding a horse in a photorealistic style” was submitted to the model, and this popped out.
Stable Diffusion: This is another text-to-image model that generates photo-realistic outputs based on user-written inputs. For fun, I gave it the prompt “A purple knight riding a large, fire-breathing, chromatic dragon” and it produced the following image:
Imagen: Keeping with the pattern we’ve established, here’s another text-to-image model. This one, however, specializes in photorealism. Its outputs are meant to look more like photography than drawn art. For example, when given the prompt “A photo of a racoon wearing an astronaut helmet looking out of the window at night,” we arrive at the following:
Deepgram: The best Automated Speech Recognition model currently on the market (but we’re kinda biased on this point). It is capable of transcribing over 30 languages—both live-streamed and pre-recorded. Organizations like NASA and Spotify apply it for use cases such as subtitle generation, audio sentiment analysis, and writing meeting notes.
Stockfish: An open source chess engine. This is the chess bot that fuels much of chess.com’s game analysis and evaluation functionality. Its moves are so advanced that it stumps even the greatest chess grandmasters and international masters.
A.I. Libraries and Frameworks
An AI Framework is a programming tool that provides developers, researchers, and data scientists the building blocks they need to design, train, validate, and deploy AI models. Here is a list of a few:
Pytorch: This framework was developed by Meta. As the name implies, it works in Python, allowing you to train neural networks or even parse audio data. As such, Deepgram—alongside other companies such as Twitter and Facebook—uses Pytorch to maximize productivity and performance when working with AI models.
It’s also worth noting that Pytorch Lightning is an open-source Python library. It's a lightweight wrapper that makes your code easier to read, log, reproduce, scale, and debug. It takes tough-to-read boilerplate code and helps you turn it into a more "English, please?" version of itself.
TensorFlow: This is a Google-developed AI framework. It brands itself as an end-to-end machine learning platform that allows programmers of “every skill level” to find ML solutions to various problems. It comes with out-of-the-box pretrained models, but also gives you the flexibility to train your own models. Tensorflow largely shows off its capabilities in the world of recommendation systems—think TikTok’s “For You Page,” Youtube’s “Home Page,” or even e-commerce recommendations!
Keras: This is a deep learning framework imported from Tensorflow. It’s open-source and specializes primarily in deep learning problems. Netflix, Yelp, Instacart, and Uber all use Keras. Moreover NASA uses both Keras and Deepgram 😉
Scikit-learn: This framework comes from the SciPy toolkit—a library designed to be really good at linear algebra. And since AI/ML models are all essentially just massive volumes of linear algebra underneath the hood, scikit-learn becomes an especially sexy tool from a mathematical and efficiency perspective.
Note that Spotify uses both scikit-learn and deepgram 🥳
Convolutional Neural Network (CNN)
If you recall the definition of Neural Networks from the first part of the A.I. dictionary, then you understand convolutional neural networks perfectly. The word "Convolutional" comes from linear algebra. It involves matrix multiplication and linear transformations, and you won’t really need to know the math this deep… that is, unless you plan to rediscover neural networks from the ground up. If you’re really curious, check out this guide. But long story short, in a CNN, we take some input vector and pass it through a number of hidden layers until some final, desired output—specifically, a probability distribution—is reached.
CNNs are typically used for image recognition, classification problems, and recommendation systems. Though, they can also be utilized in the same system as Recurrent Neural Networks to take on some heftier tasks, like Deepgram’s speech recognition 😉
Check out the illustration below for a CNN visualization. (Image source.)
Recurrent Neural Network (RNN)
Recurrent Neural Networks are similar to CNNs in that they also take vectors as input. However—instead of having multiple, hidden layers with multiple, unique matrices—RNNs simply use the same matrices over and over again, recursively. Hence, recurrent.
Moreover, RNNs are capable of processing ordered sequences of vectors, as opposed to processing a single vector at a time, sequentially. Thus, they are especially useful in situations where there are many vectors to be processed, especially if those vectors need to be processed in a particular order. One example of such sorting: Sentence processing.
In a sentence, every word can be represented by a vector. And sentences are organized in a very particular arrangement. Much like how our brains process sentences by reading one word at a time in the order that they’re written, recurrent neural networks sequentially process sentences one word at a time as well.
Perhaps it’s best explained by example:
Imagine we have an incomplete English sentence. Something like “Let’s eat ___.”
A recurrent neural network will start to process this sentence by “reading” the first word. That is, it will take the vector for the word “Let’s” and multiply it with some already-created matrix U. Think of U as a “reading” matrix. Multiplying a word’s vector by U, the RNN ends up with a better understanding of what that word means in the context of a sentence. And this understanding manifests itself in the form of a vector. We can call this vector h0. The h stands for “hidden,” like a hidden layer.
Great! We’ve officially processed the first word of the sentence. The matrix U multiplied with the vector for the word “Let’s” equals h0.
Now, for our RNN to continue reading, it must not only process the second word, but it must also remember the fact that it read and processed the first word. Here’s how we do that:
The RNN will take the vector for the second word “eat” and multiply it with the same matrix U. And boom, we’ve “read” the second word.
Buuuut, we haven’t taken the first word into account yet. We need to put these words together to fully grasp the sentence we’re building. The way we do that is by taking the hidden layer h0 and multiplying it with yet another matrix V.
Much like how U was a “reading” matrix, we can think of V as a sentence-forming matrix. That is, V is the matrix that reveals how the words in the different parts of a sentence relate to each other.
Now, the last step is to add V(h0) with U("eat") and boom, we have h1: an embedding of the sentence fragment “Let’s eat”
The RNN will then use the matrices U and V over and over again until we run out of words to process.
Okay, I know that was a lot of numbers and letters to consume at once, so let’s look at some resources to get a more in-depth understanding of RNNs. Techtarget has an incredible explanation that includes the image below. Meanwhile, RitvikMath discusses the concept on video if you’re more of an auditory learner.
Types of Learning
You may have heard of machine learning from the first part of this dictionary. Or elsewhere. But do note that there are many different forms of learning. We’ll go in-depth into the different types of learning below by using the same example for each.
Let’s say we want to train a model to be able to name species of animals based on photos it’s shown. If we show a picture of a chicken, the model should say “Chicken.” If we show a picture of a hippo, the model should say “Hippo.”
Here’s how the different types of training/learning would look like for such a task.
For our animal example, let’s say there are 10,000 total unique species in the world. Supervised learning would basically entail having multiple labeled images of each of the 10,000 species. We’d show these images to the model, and whenever it guesses right, we reward it with “points.” Whenever it guesses wrong, we deduct points.
Again, let’s say there are 10,000 unique species in the world. And again, let’s say we have multiple images of each of the 10,000 species. In unsupervised learning, none of the images would be labeled. As a result, the model will have to resort to separating the images into groups. Based on some algorithm like k-means clustering, the model should group similar images together. The end result should hopefully look like 10,000 different piles of images, where each pile contains images of the same animal (i.e., all hippo images are in a pile, all fox images are in a pile, all penguin images are in a pile, etc.).
While the model in this case is not able to assign names to the animals, the fact remains that a perfect model would be able to recognize and group all animal species.
This type of learning is halfway between supervised and unsupervised learning. Here, we still have a large dataset of animal images. However, in this case, only some of the images are labeled while the rest are not. The goal of identifying animals remains the same; however, what makes semi-supervised learning more convenient is the fact that labeling can be time-consuming or expensive. Researchers often resort to semi-supervised learning when obtaining a fully-labeled dataset is impossible or impractical. Thankfully, semi-supervised learning has proven to be effective in many cases.
To help fill in the gaps, models trained on semi-supervised learning algorithms often make assumptions about the underlying data: Assumptions like the way the data is distributed (Gaussian, uniform, bimodal, etc.) or assumptions about the existence data “clustering.” That is, the assumption is that all puppy photos will look very similar to each other and very different from non-puppy photos.
Now, let’s assume that out of the 10,000 species of animals, we have a dataset that only contains 9000 of those animals. All images are labeled, and we have multiple images of each animal. However, there are 1,000 animals not present in our training set. Let’s say that dragons are one of the unrepresented animals.
In zero-shot learning—just like supervised learning—we train the model to correctly identify animals on images it has never seen before. However, the model should also be able to sort dragon images into a category called “I don’t know. I haven’t seen this before.” More info here.
In some cases—depending on the training and testing set—the model may even be able to inference,” and it’s really cool.
Yes, machines can learn. But what can they do once they learn? Here are some examples:
Sentiment analysis is the task of assigning emotions to a block of text or some audio. For example, if I have a movie review, a decent language model should be able to label that review as “Positive” or “Negative,” based on the contents of the text—even without an explicit “x out of 10” rating.
Given a list of inputs, a model should be able to sort every entry into two categories. The canonical example of binary classification is a spam filter. Given a user’s inbox, an artificially intelligent spam filter should be able to categorize every email as either (1) spam, or (2) not spam.
Sentiment analysis can be seen as a type of binary classification. Every review is bucketed into either “positive” or “negative.”
Similar to binary classification, but with more than two options. The above-mentioned animal identification model is derived by n-ary classification. More specifically, it’s a 10,000-ary classification model. Given a set of images, it should be able to bucket every picture into the correct “animal” bucket.
Sentiment analysis can be a 3-ary classification problem if we add a “Neutral” bucket alongside “Positive” and “Negative.”
Hopefully, this dictionary was as helpful as the first one. A.I. truly is just math in its most beautiful form. By representing everything from animal pictures to human language as lists of numbers, we can teach machines to create art, identify new species, and complete our sentences.
In the same way that our brains run on nothing but a bunch of chemicals and that our DNA is made up of four simple molecules (A, T, G, C), artificially intelligent brains are made up of nothing but matrix-oriented hardware and two simple numbers (0, 1).
Crazy how such simple letters and numbers combine to form such complex beauty.
Anyway, see you next time
If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .