Article·AI & Engineering·Aug 17, 2023

Reviving Lost Tongues: How AI Battles Language Extinction

Tife Sanusi
By Tife Sanusi
PublishedAug 17, 2023
UpdatedJun 13, 2024

Every two weeks, a language dies. With over 7000 languages spoken all around the world, many linguists agree that at this current rate, at least half of these languages are at risk of extinction within the next century. Because language is the bedrock of most cultures and the knowledge of those cultures is passed on through generations, the histories and traditions of hundreds of cultures are slowly becoming extinct because there is no way for them to be preserved and transmitted.

The current state of languages today is largely a result of globalization and colonialism. Over the past few centuries, as culturally dominant languages like English became more influential, more people migrated to regions where their languages were not spoken, and those who stayed slowly adapted to the international languages of trade, contributing to the endangerment of minority languages. It doesn’t help that as technology evolved, it came with a lot of incentives to communicate in one of these dominant languages. Even now, more than 50% of all websites on the internet use English as their content language. Because of this, about half of the world’s spoken languages now have 10,000 or fewer speakers. But that same technology is offering a way to preserve endangered languages from extinction.

Nigeria As a Language Paradise

Nigeria currently has one of the largest concentrations of language diversity in the world, with over 500 languages spoken in the country. Africa, in general, is home to around 1,000 to 2,000 languages, or about one-third of the world’s languages. The diversity of languages in Nigeria reflects the diversity of the country's cultures, art, histories, and traditions. However, as is the case in many other countries, a lot of these languages are at risk of extinction.

Like many indigenous languages, most languages in Nigeria have no written form, and as the number of speakers decreases, it is more likely for the language to cease to exist. The issues affecting indigenous languages all over the world– globalization, migration, and the dominance of major convergent languages– are also increasing the rate of language death in Nigeria. Between the 1940s and today, ten Nigerian languages have gone extinct, and about one hundred more are either in trouble or dying. According to Dr. Kola Adekola from the University of Ibadan’s  Department of Anthropology and Archeology, this is because of a mix between globalization and a general inclination by the younger, more technology-oriented generation toward more dominant cultures.

“It isn’t just languages, entire cultures are being lost because they are considered archaic by new generations who communicate entirely in English. To solve this problem, we would need to use a mix of local and technological solutions,” he says.

AI in Language Preservation

As language technologies advance, there has been a lot of success in using artificial intelligence to preserve endangered indigenous languages worldwide. In 2018, a Māori people-owned non-profit radio station, Te Hiku Media, built language tech, including automatic speech recognition (ASR) and speech-to-text, in an effort to prevent their language from shrinking further, becoming the first to build ASR tools for an indigenous language. Since then, attempts have been made to preserve other endangered languages with AI. AI Pirinka is being used to preserve the unique language isolation of the Ainu people, the indigenous inhabitants of Hokkaido in northeastern Japan. Woolaroo, a project by Google, is also using machine learning to teach and preserve languages like Yiddish and Louisiana Creole. 

This doesn’t come without its own challenges, however. Many indigenous languages are under-resourced and not NLP-supported, especially since most NLP work is Indo-Eurocentric in terms of preprocessing, training, and evaluation algorithms. African languages, in particular, are at risk of being left behind because of a lack of resources. This includes datasets that can be used for training ML models. Many datasets involving Nigerian languages are either incorrect or mislabeled, which will, in turn, result in inaccurate models. 

Ethics of AI in Language Preservation

Perhaps the biggest obstacle in language preservation for endangered languages is the potential for exploitation of indigenous people. Many endangered languages are at risk of extinction due to cultural replacement and expansionism, so the people who speak them are understandably wary of outside interventions. In the case of the  

Te Hiku, it was important that the only people who profit from their language are Māori people themselves. For them, protecting their data means protecting thousands of years of traditional knowledge. For Dr. Adekola, the rewards outweigh the risks. 

“There is a crisis, and the truth is, if nothing is done, we will lose so much history and knowledge. If AI is a way to prevent that, we need to embrace it while making sure that our cultures are being respected,” said Adekola.

Language research, especially with endangered languages, can be exploitative if ethical standards are not firmly established and upheld. When working to preserve languages, it is imperative that the agency of the people who speak them is respected and extractive practices are discouraged. This means adopting a more conscious approach to language preservation and working hand-in-hand with collaborators from the community. 

There are also some reservations about the capacity of AI to understand the depth of indigenous languages fully. This is part of a larger conversation on the ability of NLP actually to comprehend language as used by humans. Many indigenous languages specifically rely on tone, tone marking, vowel harmony, and context, which are missing in most dominant languages.  This is especially difficult since most of these languages are purely oral without any written form, making it challenging to preserve them without sacrificing the non-written context many of them have. Some communities, like the Shoshone community in the U.S. Southwest, are rejecting efforts to standardize their language in written form

Conclusion 

For African languages, there has been an increase in resources created and curated by people who speak the language. Masakhane, which means “we build together “ in isiZulu, is a grassroots organization whose mission is strengthening NLP research in Africa. By offering tools to train baseline models for a wide range of African languages, they have helped build models for more than 35 African languages. Other organizations like Deep Learning Indaba and Black in AI are attempting to build a sustainable community of AI experts both in Africa and the diaspora. The African Language Dataset Challenge
was created to incentivize the creation of datasets for African languages to address the issue of datasets. That way, African people's rich cultures, and languages are represented and protected.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.