The Complete Guide to Punctuation & Capitalization in Speech-to-Text
Do you ever get frustrated when you're trying to dictate a text message or email and your phone keeps capitalizing the wrong words? Or adding extra periods at the end of your sentences? You're not alone! Automatic speech recognition for punctuation and capitalization can be tricky. In this blog post, we'll explore what punctuation and capitalization mean, how they're used, and some of the problems they present for speech-to-text solutions. We'll also explain what your best option for a speech recognition solution is if you need a transcript that's punctuated and capitalized correctly. To get started, let's take a look at what punctuation and capitalization are, how they're used, and how they different cross-linguistically.
What is Punctuation?
Punctuation characters are symbols that are used to indicate the structure and organization of a text. In the West, the tradition of punctuation dates from the 3rd Century BCE. Before that, texts in languages like Greek and Latin were written without any punctuation or capitalization at all-and even without spaces between words! Today, punctuation marks are used, among other things, to separate words and phrases and indicate when a sentence is ending. In English, the most common punctuation characters are the period ( . ), comma ( , ) question mark ( ? ), and exclamation point ( ! ), but there are many others, including the semicolon ( ; ), colon ( : ), dash ( - ), parentheses ( (...) ), and quotation marks ( "..." ). But other languages have different punctuation standards. In German, for example, quotation marks often appear as ( «...» ). Japanese punctuation, although somewhat inspired by how Western languages punctuate, uses its own symbols, using ( 。) instead of a period/full stop, and ( 、) for commas. Other languages have entirely separate traditions of punctuation, marking things that we wouldn't in English or other Western languages. If we look at Tibetan, for example, we find characters that mark the break between syllables ( ་ ), symbols for the end of a section of text ( ། ) and a larger topic ( ༎ ), and a character that marks the start of a text ( ༄ ).
What is the Purpose of Punctuation?
The main function of punctuation is to make a text more understandable. As mentioned above, prior to punctuation, words were written in a singlestreamofcharacterslikethis, which made texts challenging to read. Language is, first and foremost, a spoken or signed system of communication, and not a written one. That means we often need some help to make sure that we can understand what's being communicated in writing, and punctuation is one of the tools that we use to do that (along with other things like spelling according to the relevant standard). For example, if someone's speaking out loud, you're unlikely to confuse "Let's eat, Grandma!" and "Let's eat Grandma!"
But in writing, it's the comma that makes the difference. Punctuation helps by providing some sense of the intonation and pacing that would occur if a sentence was spoken out loud. For example, commas are used to mark brief pauses between words or phrases, while periods are used to mark the end of a sentence, what linguistically we might call an intonational unit. In both cases, these pauses have specific acoustic features that indicate what a speaker is doing when spoken out loud. Likewise, question marks and exclamation points can be used to show excitement or emphasize a point, again reflecting how a sentence would be pronounced and serve to influence the way that a sentence is read. By including punctuation marks like this in writing, the written word is brought closer to the spoken word.
Organization of a text is another purpose of punctuation. For example, semicolons and colons are often used to list items, while dashes can be used to separate parts of a sentence and quotation marks are used to set off dialogue or direct quotes from other sources. These features might not exactly match certain pauses or intonation in the same way that a question mark does, but they still serve to help make a text more understandable. For example, English has a particular intonation for lists-if I say "I need three things: milk, bread, and eggs" there's a pause at the colon, then rising intonation on "milk", then a pause, then rising intonation on "bread", then a pause, and then falling intonation on "eggs". This pattern helps our listeners understand that we're listing things off, but it's spread across several words, and doesn't occur only where the colon does.
What is Capitalization?
Capitalization is the process of making a letter capital, or upper case. In English, we typically use capital letters to begin sentences and proper nouns, or for emphasis in casual writing. Proper nouns are the specific names of people, places, things, or organizations. For example, "Susan," "New York City," and "Nintendo" are all proper nouns. Other languages have different standards for capitalization. In German, you capitalize every noun, not just proper nouns. And many languages don't have capital letters at all. Arabic, Hebrew, Japanese, Chinese, Hindi-no "capital" option exists in these languages.
Although capitalization isn't found in all languages, it's still important to consider it along with punctuation when thinking about ASR. To some extent, this is because in many Western languages, the two go together-a period marks the end of one sentence, and a capitalized word marks the start of the next. Additionally, these have often been thought of as the same kind of problem and been treated together historically, so it makes sense to think about them together.
Why Punctuation and Capitalization Matter for Speech Recognition
Typically, ASR systems don't output punctuation or capitalization-the ASR transcripts that you get just consist of lower-case words without any punctuation at all. If you're planning to use your transcripts as the input for machine learning, you might not need to worry about punctuation at all. Typically, these systems are happiest working with unformatted text. However, there are a couple of reasons why you might want your text to be formatted. The first is that, without this formatting, the texts can be hard for humans to read. Just take a look at the snippet below, from Deepgram's transcription of NASA's first all-female spacewalk, and you can see how hard it is to figure out what's happening without capitalization and punctuation.
and jessica and christina we are so proud of you i'm gonna do great today we'll be waiting for you here in a couple of hours when you get home i'm gonna hand you over to stephanie now have a great great eva drew thank you so much and our pleasure working with you this morning and i'm working on getting my ev hat open and i can report is open and stowed
If you want something that's readable by humans, capitalization and punctuation are necessary to make things clearer. You can see the same NASA text below, but with capitalization and punctuation (as well as diarization-breaking up the transcript to isolate different speakers).
[SPEAKER 1:] ...and Jessica and Christina, we are so proud of you. I'm gonna do great today. We'll be waiting for you here in a couple of hours when you get home. I'm gonna hand you over to Stephanie now. Have a great great EVA.
[SPEAKER 2:] Drew, thank you so much. And our pleasure working with you this morning, and I'm working on getting my EV hat open and I can report. Is open and stowed.
It's obvious, even at a glance, how much easier it is to read a text like this than it is to read the kind of stream-of-consciousness we see without punctuation and capitalization. Another use for punctuation in speech-to-text relates to sentiment analysis. Punctuation marks can be used to divide a text into sections so that you can ask "what's the sentiment in this particular chunk of the transcript?" If you have a transcript from an hour-long call, it's more granular to look at sentiment in small chunks, rather than across the whole call, which punctuation makes easier to do. So how do we get ASR transcripts that have punctuation and capitalization? Let's take a look.
How Punctuation in Automatic Speech Recognition Works
If you need punctuation in your transcripts, what are your options? ASR providers typically have two ways of getting punctuation into a transcript. The first is a separate punctuation and capitalization model that runs after the text has already been generated by the speech-to-text model. Because punctuation is often a reflection of how something would be said out loud, this method, based only on the text, can create some less-than-useful outputs. This can be shown with a simple example. What is the correct punctuation for the sentence below?
sam ate dinner
You might have said "a period", which is probably what one of these post hoc models would have said. But is this a question or a statement? From the text alone, without any context, it's impossible to determine-it could be a statement or a question. If I played the audio for you, though, you'd immediately understand whether the sentence is a question or a statement based on the speaker's intonation. Determining the correct punctuation in this case is impossible with the words alone; it's much easier if you have access to the audio.
This leads us to the second way to add punctuation to a transcript. If you're using an end-to-end deep learning system as part of the process to generate your transcript, it's possible to have it output punctuation and capitalization at the same time as the words. A model that creates both text and punctuation can make decisions based on acoustic information, which is often the difference between good and bad punctuation. Let's consider another example.