Word Vectorization: How LLMs Learned to Write Like Humans
Jose Nicholas Francisco
Before you ask, no I did not get a computer to write this article for me. Every word here is human-typed. That being said, it won’t be long before Artificial Intelligence (AI) could give me a run for my money.
That’s right. The smart kids of this generation are outsourcing their book reports, their History class analyses, and sometimes even their college admissions essays to artificial intelligence.
The fancy term for an AI-author is “Generative Model.” That is, it is a piece of software that creates (read: generates) never-before-seen content based on the prompts that you give it. And yes, AI researchers call their software “models”; Python code can be pretty attractive after all:
from time import sleep def flirt(username, sleepTime): username = username.strip() messageA = "Hey " + username + ", you're pretty cute." print(messageA) while True: print("Do you think I'm cute too? *bats eyes*") sleep(sleepTime)
But how do these machines learn to write? How do they know what good, coherent writing looks like? Humans need years of education to write a proper argumentative essay. How do you train a machine—who knows nothing but numbers—to write an essay worthy of an A+ in English? Or history? Or even philosophy?
Answer: You turn the words into numbers.
Computers love numbers. But they can’t be just any numbers. To teach a computer to write, we have to pick very specific, very calculated numbers to represent our words.
Right now, if I type the word “dog” into the same place I wrote the flirt() code above, all the computer would see are the unicode numbers that represent each letter. Specifically, it would see the numbers 99, 97, and 116, in that order. Or, if we dug a little deeper, we’d see the series of zeroes and ones that make up the word "dog": 110001111000011110100
But the word “dog” is extremely versatile. A dog can take on many forms. Puppies, wolves, coyotes, Snoopy, Snoop Dogg, and so on. An investigative journalist may engage in “dogged” pursuit of a story. One might “dog on” one’s friends as an exercise in jovial criticism. The list goes on.
It’s a bit difficult to encapsulate all that a dog can be with three measly numbers. So what can we do? Well, for starters, we can use more numbers... way more than three. But we have to use numbers. Computers understand nothing else. That little green microchip inside your laptop really is just a fancy calculator, after all.
Okay, so we’re going to use a lot of numbers to represent the word “dog”. But which numbers do we use? And in what order do we use them? Well, here’s what a bunch of smart people figured out a long, long time ago:
“You know a word by the company it keeps.”
But what does that mean?
Well, the intuition is this: When two words appear together often, it’s likely they’re related to each other in some manner. For example, the words “hot” and “cold” frequently appear together. You find them in cookbooks to discuss oven temperature, in medical journals to discuss patients’ vitals, and even in gambling books to describe the state of a deck of cards.
If a document, web page, or book contains the word “hot,” it is decently likely to also contain the word “cold.”
The word “hot” can, however, also describe a person’s attractiveness. So the words “hot” and “sexy” frequently appear together in the same web pages as well. Likewise, “hot” can appear next to the words “guy” or "girl."
And just to hammer the point home, the word “hot” can also describe spiciness. So the word “hot” appears around the words “pepper” and “sauce” pretty frequently too.
So here’s what we know: The word “hot” frequently appears near the words “cold,” “attractive,” and "pepper." But we also know that the word “hot” doesn’t appear too often around the words “raincoat,” “counterclockwise,” or "umlaut."
What information do we gain from this? Well, if we didn’t already know what the word “hot” meant, we could deduce that this little, three-letter word has some consistent relationship to all the other words we mentioned. “Hot” has a strong relationship with “cold” and “pepper”, while it has a weak relationship with “ampersand”.
Thus, if two words appear in multiple documents and web pages and books and speeches together, it’s likely they have something to do with each other. And for every word in the dictionary, if we find its common companion words, we can get pretty close to figuring out its meaning.
A Word Is Known By Its Surroundings
So how does this principle help us turn words into numbers? Well, for a given word, we can create a list. This list will contain one number for every word in the dictionary. The first number in the list will correspond with the word “aardvark.” The second number will correspond with “ab.” And so on and so forth.
The number we use is the percentage of documents in which both words appear together. For example, let’s say we’re creating a numbers-list for the word “dog.” To create our numbers list, we’re first going to gather every document on the internet and find the ones that contain the word dog. Then we’ll find how many of those documents also contain the word “aardvark.” Let’s say that 5% of all “dog”-containing documents contain the word “aardvark.” Then the first number in our list would be 0.05. Then let’s say that 12% of the documents containing “dog” also contain the word “ab.” The second number in our list would be 0.12. We’d continue this calculation until our list is complete. (Note: I skipped the word “a” for ease of explanation.)
And boom, we have a very detailed, very specific, very calculated list of numbers to represent “dog.” Notice that each entry in this list represents a specific word's relationship with “dog.” And also notice that every other word in the dictionary will have its own list, and those lists will also contain some number attached to the word “dog”. The fancy term for these lists of numbers is “vectors.”
This concept may seem weird at first, but it works extremely well in practice. Let’s see why:
Because our words have been transformed into lists of numbers, we (and our computers) can basically treat them like numbers. As a result, we can add words together, subtract them, and so on.
The canonical example of word-math involves royalty. It turns out that if a computer takes the word “King”, then subtracts the word “man” and adds the word “woman”, the output is “Queen”.
Here are some other fun examples:
If you take the word “ice,” subtract the word “solid” and add the word “liquid,” you get “water.”
If you take the word “Tokyo,” subtract the word “Japan,” and add the word “France,” you get “Paris.”
If you take “Paris,” subtract the word “France” then add “U.S.A.,” you get “Washington D.C.”
If you add the words “beautiful” and “smart” and “talented” together, you get me.
Okay, that last one was a joke, but you get the idea.
The concept of creating lists of numbers is crucial to the world of AI. These vectors (sometimes called “embeddings”) form the basis of tons of large language models out there, including and especially the ones that write essays for us.
In other words, a bunch of people who are really good at math designed computer programs to utilize these word-vectors in extremely creative ways. Indeed, the creation of a language AI is a beautiful example of the intersection between creativity and math. After all, you need to be really creative with numbers to make use of these new word-vectors.
Here are some brief examples of this creativity come to life.
The BERT model really takes the expression “We finish each other’s sentences” to heart. Here’s the intuition behind it:
If I say the first half of a phrase, you should be able to complete it (assuming you have a good amount of experience in the English language). Let’s try it out. Complete the following sentences and phrases with a single word.
“Please and thank ___”
“The weather today is warm and sunny. A perfect seventy ____.”
“Dogs say woof. Ducks go ____.”
The answers are "you," "degrees,” and "quack,” in that order. Though if you said “Fahrenheit” for the second sentence, that’s acceptable too. But that’s a nuance. The point is this: BERT knows how to finish your sentences. In the same way we turned words into vectors, BERT can turn sentences into vectors as well. And through a bit of clever vector math (read: “linear algebra”), the model can predict the next word in a sentence.
Note also that the '____' doesn't have to go at the end of the sentence. BERT can fill in the blanks (read: 'masked words') whether they're at the beginning, middle, or end.
BERT has been used to aid Google’s search query functionality. It is also pretty good at answering SAT reading questions. That is, it can read a literature passage and answer multiple-choice questions about that passage.
BERT has many descendants like RoBERTa and DistilBERT, but you can read about the original model here.
GPT-3 is a generative model. Again, quite a bit of math occurs underneath the hood, but the punchline is as follows:
Because we’ve successfully transformed words—and, by extent, sentences—into numbers, we’ve essentially given large language models the ability to read. All they have to do is process one word at a time, while also remembering a handful of sentences it recently read. (This memory is called “Attention” and its mechanics require its own article.)
In any case, we’ve shown GPT-3 loads of documents. It’s quite the bookworm. And because it has read a lot of human words, it knows how to write like a human as well. And it’s written everything from essays to this Guardian article.
Making an AI read tons and tons of literature, documents, and articles is called “training”. And if we train GPT-3 by exposing it to nothing but Shakespeare and his contemporaries, then GPT-3 will speak like Shakespeare. If we train GPT-3 on nothing but TikTok comments, then it will speak like teenagers on the internet. And if we were somehow able to show GPT-3 everything you’ve said, written, or thought, then it will be able to write exactly like you would.
There are numerous models out there trained on a plethora of texts. If you want to find the Shakespeare AI, simply look up "Generative models for Shakespeare." Or, as the title of this article suggests, if you want AI to write your essays for you, simply look up "AI essay generation" and see what the programmers of the world have created for you to use.
… And yes, if you train an AI on data from thousands upon thousands of nonfiction explanatory articles on technology, it can write a blog post like this.
Or am I?
If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .