Artificial Intelligence and The Threat of Minority Language Extinction

The artificial intelligence language model ChatGPT officially supports 95 languages out of 7000 languages around the world, meaning that nearly 99% of languages have not yet found their places in the artificial intelligence environment.

Nguyễn Phong Anh

February 14, 2024 | 12:40

OpenAI, the owner of ChatGPT, announced on its homepage that up to 180 million people around the world are using ChatGPT every month. Every day, billions of conversations between humans and machines are being created. In those billions of conversations, we see the lack of minority languages. Language is not only a means of communication but also a cultural treasure, containing the unique wisdom, knowledge, and emotions of a community. However, in the game of generative artificial intelligence, many languages are at risk of being forgotten. Machines learn to "speak" and "listen" through common languages. Meanwhile, the voices of small communities with unique languages fall silent and are under the threat of extinction.

Tet special: AI article publication on Feb 14

OpenAI, the owner of ChatGPT, announced on its homepage that up to 180 million people around the world are using ChatGPT every month (Photo: Getty Image).

Large Language Models (LLM)

Among many research fields about artificial intelligence, there is generative AI, and among many fields about generative AI, there is a branch that is of great market interest, which is Large language models (LLM). In short, each large language model is a machine created to predict what is next after each word. It is similar to the game show "Wheel of Fortune". The host will give the riddle and the number of letters in the answer and your job is to guess what the answer is.

The machines are very skillful at this language game. They will use statistical probability combined with context to guess which letter is most likely to match the answer. At the advanced level, they can come up with sentences, paragraphs, and ideas that are most appropriate to answer the question. Like humans, vocabulary and knowledge are needed for answering questions. In the language of computer science, they are called data.

According to BBC Science Focus, a machine called GPT3 was trained on 570 GB of filtered text data. This text data contains about 300 billion words, equivalent to about 850 million pages of text printed on A4 paper, 12-sized Arial font.

The numbers are impressive, but this data source is still only a very small part of humanity's information warehouse. In this case, the quality of the information warehouse created by the data source cannot be compared to major libraries in the world, and there are still countless miscellaneous things on the inside. Notably, more than 9/10 of this data source comes from English documents. Other languages such as French, German, Spanish and Italian make up most of the remaining pie. All other languages in the world occupy a slice as thin as a rice leaf, if available.

The main feeds for ChatGPT

The information above shows immediate evidence about the quality of ChatGPT. The GPT models 3.5 and 4 can answer very fluently in English, but encounter many poor errors in Vietnamese, often creating clichéd paragraphs and confusing grammatical expressions.

In short, the higher quality data, the better guesses LLM will have. Vice versa, the less and worse the data, the more likely to create low-quality language models. There is a saying in the IT crowd that, if the input is garbage, the output is also garbage.

Even with the most advanced technologies, major language models are yet to reach the richness of human language and culture.

Artificial intelligence and the threat of extinction for minority languages

There are currently about 7,000 active languages in the world (Illustration photo).

The "barely alive" languages

According to the International Decade of Indigenous Languages program by the United Nations General Assembly (UNESCO), there are currently about 7,000 active languages in the world. But every two weeks, a language is lost. It is caused by the death of the last person to master that language or they become unable to communicate. By the end of the 21st century, it is predicted that humanity will witness the disappearance of about 3,000 languages.

The barely alive languages that survived mostly belong to indigenous ethnic minorities. For development opportunities, many communities have to gradually abandon their traditional language to blend in with other languages. The common languages of more wealthy communities have taken on a sizable role in economics, politics, education and technology.

If you were a native on an island in the South Pacific, you would spend the whole day speaking Chinese with tourists, reading newspapers in English, filling documents like marriage certificates in French, and speaking with colleagues in Bilasma. When do you have the time to speak Naati? Perhaps only in your dream. The reason is because you are the last Naati speaker.

In more extreme cases, like the Native Americans of the late 19th century, the host government used violence to force the people to give up their language.

People who lose their languages will lose the opportunity to learn from their ancestors. They drift in a psychological state where they lack identity and origin. As they are not able to recognize who they are, they fail to connect to their community. Alone in a bustling world, they will suffer feelings of helplessness, sadness, bereavement and the risk of losing their roots.

Each lost language adds a missing piece to humanity's cultural, intellectual, and creative diversity. If one language is lost, human perspectives would be more monotonous. Many languages are lost, and human perspectives will become biased and distorted. A few mainstream lines of thought in strong languages will prevail without facing valid and necessary criticism.

The data that was popular in these languages will become even more popular. Meanwhile, those that were expressed in less common languages will gradually disappear, even though they are also very valuable.

AI: the extension of bias

In 2017, an internal investigation of National Geographic magazine showed that before 1970, it was filled with discrimination against people of color. Established during the height of colonialism, this magazine was heavily influenced by racist ideology.

People of color were always shown in skimpy clothing, especially women. They were portrayed as bizarre, wild, backward-minded and often overwhelmed by modern Western machinery.

Photography historian John Edwin Mason, who participated in the investigation, said that American people took ideas from popular movies like Tarzan or crude caricatures of racism to express their views on the world. We can also see such biases in French photos of Vietnam in the early 20th century.

Those past times are supposed to be behind us yet in 2015, in a National Geographic issue, there was a photo called "Come up for air" that sparked controversy. The photo shows an aerial view of a rooftop in the Indian city of Varanasi. On the roof of that house is a family of more than a dozen people, including some sleeping women and children. There are naked babies as well

National Geographic is called out for using double standards, claiming that if the photo was of a Caucasian family in the West, this magazine would not publish it. The magazine would be sued for invasion of privacy for this photo. However, since it was taken in India, the possibility of lawsuits is much lower and there is nothing to worry about.

In the “wild ocean of the Internet,” such biased data is as plentiful as plastic waste. Being "taught" with millions of texts collected from the internet, LLM not only learns language but also absorbs biases and inaccurate information. With what it learned, AI can generate biased and discriminatory answers, especially when talking about sensitive issues such as race, religion, gender and politics.

(Illustration photo).

OpenAI claims that it pays attention to disadvantaged groups and would do everything it can to prevent toxic ideologies, trying to create unbiased and ethical artificial intelligence. If this is true, it will be an honorable and meaningful effort for humanity in the current period.

But the same claims were made by major online services like YouTube, TikTok, Instagram, and Facebook. These firms said they tried to create a healthy environment. Their users saw the results of their actions compared to their words. "It's safer in the forest than on the Internet", Vietnamese singer Den Vau described in his song "It's extremely cloudy today".

A survey conducted in January 2024 posted on arxiv.org called "Thousands of AI Authors on the Future of AI", shared many interesting predictions about the future of AI. For example, there is at least a 50% chance that by 2028, AI can create songs identical to major artists, or automatically build a payment website from scratch. This is a poll conducted from 2778 leading artificial intelligence researchers around the world. As the survey is conducted annually, each year researchers make new, earlier and stronger predictions about AI's future milestones. It shows that the development of AI is happening faster than the experts predict.

If the prediction is correct, anyone who knows how to use AI can be a musician and writer. It also means that if biased or errored information gets through the AI filters, it can be replicated many times. In an environment full of such biases and deviations, discrimination and disagreement between communities and cultures will grow stronger. Ultimately, the vulnerable people and their rights will be seriously harmed.

Challenges for the lesser-known languages

As mentioned, with nearly 99% of languages having not yet found their places in the artificial intelligence environment, creating LLM for them can be difficult. First, many minority languages do not have enough text or speech data needed to train a language model. This includes data that is high quality, diverse, and representative of that language.

Second, even if linguistic resources are abundant, collecting them is not simple. We will need a strong force of linguists, ethnologists, and historical and cultural researchers... to collect, evaluate, and verify data carefully and professionally. How exactly can we find a large number of social scientists and humanities to digitize the remaining 7,000 languages?

Third, many minority communities also do not have access to the Internet, and the amount of data they create about their own people is insignificant.

Fourth, minority languages often have very different linguistic structures and vocabulary compared to major languages. New language models will be needed to accommodate unique languages.

And finally FUNDING. Who will pay for such a difficult task? Today's major language models are built by private companies. These companies care about making profit, while the profitability of languages that few people speak is very uncertain.

(Illustration photo).

However, there is still hope. LLM like ChatGPT are getting smarter, requiring less data but giving accurate results. On the other hand, AI can help linguists restore languages that are on the verge of disappearing.

According to the website Statistica, the global AI market size in 2023 is about US $207.9 billion. By 2030, this number is predicted to increase from 3 to 7 times. With such a large amount of funding pouring into the market, hopefully, there will be enough resources to create LLM for lesser-known languages. Still, dedicated individuals are currently finding ways to connect with each other through social networks to build their own LLM. Although these are just small efforts, they bring hope for a future where communities can build their own LLM.

Those who are AI creators, especially in the LLM field, hopefully will make room for minority languages. These languages are important means of culture, which builds the core of human civilization. And unlike algorithms and involuntary machines, at their cores, AI creators have human hearts.

Vietnam, Switzerland Cooperate in Training AI, Semiconductor Chip Human Resources

Vietnamese and Swiss higher education institutions signed documents to promote cooperation and improve Vietnamese firms’ human resources quality in the field of artificial intelligence (AI) ...

Artificial intelligence enhances India’s growth horizons

Artificial Intelligence (AI) is reshaping businesses, industries, agriculture and socio-economic development at a global scale. AI has experienced substantial progress across various sectors, heralding a ...

The World in 2024: AI and Experts' Forecast

According to experts, in 2024, although the world tends to recover economically, it will continue to face existing challenges. Artificial intelligence (AI) agrees with many ...

Nguyễn Phong Anh