How Do Language Models Handle Obscure Words?
Galumphing.
Any clue what that means? Unless you’ve read a specific poem, read extensive 19th century literature, or googled it (but hold off on that for now), you’re likely grasping at straws.
No worries, though; you’ll have a better shot at deciphering this:
Now you can probably infer that “galumphing” denotes some action—but exactly what kind of action remains murky. Maybe we can glean a bit more meaning with a bit more context; here are a few more lines from Lewis Carroll’s Jabberwocky:
When we encounter a word beyond our repertoire, we take clues from the words preceding and succeeding our mystery word. Linguist John Firthsuccinctly described this linguistic phenomenon as "You shall know a word by the company keeps." (and a sentence by its surrounding sentences, a paragraph by its neighboring paragraphs, and so on). If you read Jabberwocky in its entirety (don’t worry, it’s short), you’ll get sufficient background to understand that “galumphing” denotes some movement someone makes after beheading a mythical creature (you might have extracted meaning via another method, which we’ll touch on later, but context is likely your most helpful tool here).
This lexical semantics version of peer pressure (words’ influencing their neighboring words’ meanings) inspired word embeddings, a crucial component of many modern language models (LMs). For computers to represent words, we tokenize (i.e., split) sentences into words and then map those words to numbers stored in vectors (i.e., embed the words). With these embeddings, most modern LMs employ deep learning-based statistical methods to guess about what word is most likely to come next, given some sequence of prior words (or what a masked word is, given the words that surround that masked word).
To gain this capability, LMs train on large corpora (collection) of text or speech. If a LM encounters or hears a word enough times within its training corpus, the model learns that word’s likely neighboring words. But what does a trained LM do at inference time when it stumbles upon a word that’s not represented in its word embeddings? How can an LM possibly guess what company an unknown, unembedded word keeps?
Out-of-Vocabulary Words
Words that LMs see during their training data are part of their “vocabulary,” and they don’t see during training are “out-of-vocabulary” (OOV) words. Figuring out how to make LMs better handle OOV words at inference time is an active and important research area for several reasons:
Language is fluid—we’re constantly tweaking and outright inventing words, from slang to technical jargon and everything between.
Typo variants are endless.
Language is rife with specialized, sparsely used words that might not appear in training data but that we want LMs grasp nonetheless (e.g., the medical jargon that doctors use or the extinct terms that historians study).
Computing constraints prevent us from designing LMs that represent and remember every word from a language (that’d be too many embeddings). Instead, models only learn and embed words that show up above some threshold number of times in their training data, capping their vocabulary size to match the computing limitations. But even if LMs could learn every single word, frequently retraining models—especially resource-intensive large language models (LLMs)—just to learn new words is too costly (though frequent fine-tuning might be feasible). If frequent retraining isn’t an option, what can LMs do about OOV words?
Replacing Unknowns with Likelihoods
A rudimentary method of handling OOV words is to label them as <UNK> tokens (or some similar token) and then replace each <UNK> token with the word that most likely stands in its place (i.e., the nearest neighbor in the word embeddings’ vector space, given the company that specific <UNK> token keeps).
You can probably imagine how swapping an <UNK> word with its most likely replacement might get away from that unknown word’s meaning, but, to demonstrate this, let’s ask ChatGPT about that sentence from Jabberwocky (in isolation, since it already memorized the entire poem):
ChatGPT sensibly replies:
These are all fine guesses for a generative LLM—a model largely designed to fill-in-the-blanks with probable answers. But it’s obvious how this method can falter at deriving OOV words’ meanings (perhaps unsurprising, given that guessing OOV words’ likely replacements and guessing OOV words’ semantics are quite different tasks). For some applications, though, capturing OOV words’ meanings is more important than producing realistic sounding language (e.g., machine translation or screen readers). Thankfully, we have more sophisticated approaches that help out here.
Mincing Words into Morphemes
Perhaps Lewis Caroll’s “galumphing” caught on among other 19th century authors because its blend of “galloping” and “triumphing” is intuitive enough (given sufficient surrounding context) to convey meaning. We can sometimes piece together a previously unencountered word’s meaning because we don’t solely rely on contextual clues to derive a word’s meaning; we also find parts of words informative.
These subwords are morphemes—the bare minimum components of words that convey meaning. The morphemes “un,” “book,” and “ed,” for example, each contribute their own portion of meaning to the whole word “unbooked.” Similarly, phonemes are the smallest segments of spoken words. Since “book”, “cook”, and “took” have different meanings based on how their first letters are pronounced, we consider /b/, /c/, and /t/ phonemes (as well as /oo/ and /k/). The important part here is that the written and spoken languages can be decomposed into parts smaller than the word level.
Popular LMs like BERT and GPT employ methods like WordPiece and Byte Pair Encodings that roughly capture (some) words’ decomposability by splitting (i.e., tokenizing) at subword levels. Unigram is another widely-used subword tokenization model. Subword tokenizers can end up with tokens only a few characters long or tokenize an entire word. Other tokenizers go further—all the way to the character level. CharacterBert, for example, modifies the traditional BERT architecture by using a convolutional neural network that focuses on each character in a word.
Though their approaches can differ significantly, subword tokenizers share something in common; they all allow LMs to better handle OOV words by representing words closer to a morpheme-level of granularity than a word-level granularity. If, for example, a LM never encountered “unbooked” during training, but did see at least one of “un,” “book,” or “ed,” it’ll have a better chance at guessing what “unbooked” means at inference than if it had not embedded any morphemes within “unbooked.” While it’s obvious how subword tokenization can help LMs guess OOV words’ meanings, subword tokenization’s limitations are less apparent.
No One Tokenizer Fits All Languages
Subword tokenization can work well for languages with morpheme-to-word ratios above one-to-one (e.g., many Indo-European languages), where morphemes can be joined together like Legos to form words (sometimes quite long). “Supercalifragilisticexpialidocious,” for example, which you may recognize from the film Mary Poppins, has enough recognizable morphemes for you to parse out some meaning even if you’d never encountered the word. Though English can sometimes string together many units of meaning into a single word like this, it has a fairly low morpheme-to-word ratio compared to some languages.
At the high end of the morpheme-to-word ratio continuum are polysynthetic languages (e.g., Inuktitut or Kabardian), where the meaning that might be represented within one English sentence is expressed within one long word (by English standards). The single Inuktitut word “tusaatsiarunnanngittualuujunga,” for example, means something like "I can’t hear very well." At the opposite end of this continuum are isolating languages, like Vietnamese, with a low (near one-to-one) morpheme-to-word ratio. Complicating matters further, languages like English, widely considered an analytic language (somewhere between Vietnamese’s low and Inuktitut’s high morpheme-to-word ratios), can sometimes have isolating (approximately a one-to-one morpheme-to-word ratio) sentences like, “Did you see the bat fly over me?” But that’s not all. Languages also differ in how cleanly their morphemes join. Some languages join morphemes along distinct, splittable lines (agglutinating); others join morphemes along unclear, indivisible boundaries (fusional).
Notice how languages with different morphological structures tend to have different token lengths on average (e.g., English averages a few tokens per word while a single Inuktitut word is divided into 13 pieces). With this small sample size, we get a glimpse at the sub-word tokenization differences across language types, but others have more definitively tested what we see above.
Machine learning engineer Yennie Jun, for example, tested how OpenAI’s BPE tiktoken tokenized several languages, finding significant differences in how each language tokenized the same 2033 texts from MASSIVE, a multilingual dataset. English, for example, had the shortest average token length of 7 characters, and Burmese had the longest average token length of 72 characters.
Studying this variability in more depth, Park et al. showed that languages’ morphological complexity positively correlates with language modeling difficulty. They also found that linguistically-informed segmentation aimed at capturing a language’s morphology often outperformed pure statistics-based segmentation approaches (like BPE) at dealing with OOV words. From this finding, Park et al. recommend fusing linguistic aspects beyond morphology into LMs to improve their performance in other realms.
Modeling Languages’ Nuances
Preventing LMs from “galumphing back” when they encounter OOV words likely requires heeding Park et al.’s advice. We probably need to model languages in ways that better capture the complexity and diversity of human languages. University of Utah linguist Lyle Campbell estimates there are around 350 independent language families (a set of languages related to one another, including sets containing a lone language with no known relatives—a language isolate). If we want to chip away at NLP’s nagging high-resource-language-low-resource-language-performance divide, we can’t avoid reworking our existing one-tokenizer-fits-all-languages approaches. Independent language families have varied enough structures to warrant tokenizers tailored to each language family’s morphological structure. This approach should help LMs better grasp OOV words and, in turn, help low-resource LMs catch up.
Note: If you like this content and would like to learn more, click here!
Unlock language AI at scale with an API call.
Get conversational intelligence with transcription and understanding on the world's best speech AI platform.