When artificial intelligence opens up to African languages

LETTER FROM AFRICA

Ask ChatGPT to list the names of African countries. So far so good. Complicate things a little by asking him in Tigrinya, a language spoken in Eritrea and northern Ethiopia. “The result is something nonsense, with a mixture of Amharic (another Ethiopian language)from Tigrinia and invented words that don’t make sense in either language.”notes Ethiopian computer scientist Asmelash Teka Hadgu, after having challenged the conversational robot designed by OpenAI.

The same experiment could well have been carried out with Ewe (Ghana, Togo), Yoruba (Nigeria, Benin) or Tsonga (South Africa, Mozambique). The overwhelming majority of the approximately 2,000 languages ​​spoken on the continent are almost non-existent on the Internet and therefore not or poorly recognized by artificial intelligence (AI) systems such as ChatGPT, Google Translate or Siri. They are called “low resource,” unlike a handful of “high resource” languages, led by English, that now dominate the global Web.

Like Asmelash Teka Hadgu, a growing number of African entrepreneurs and researchers are now working to fill these gaps. Based in Berlin, the Ethiopian co-founded the start-up Lesan in 2019, dedicated to the languages ​​of his native country. The company has developed an automatic translation tool between Tigrinya, Amharic and English, and plans to add Oromo and Somali soon. Unable to rely on a large number of online resources (for example, there are only 15,000 Wikipedia articles in Amharic, a language spoken by 30 to 50 million people), the team must demonstrate creativity in data collection.

Much of it is collected in books, magazines and documents with the help of local contributors. They identify the most relevant content, then scan and translate it, all using an optical character recognition system. “It requires a lot of work, especially manual work.recognizes the businessman. But we found that it is possible to build a qualitative model based on small, carefully selected data sets. »

The relevance of the methodology in question

Technology giants also claim to want to participate in the promotion of these underrepresented idioms, while, according to experts, almost 7,000 languages ​​in the world are threatened with invisibility or even digital death. Version 4 of ChatGPT includes some, such as Icelandic. Google Translate, in turn, included around fifteen African languages ​​during updates in 2020 and 2022. But the level of translation proposed is often insufficient, and African researchers question the relevance of a methodology that does not respond to the specificities of languages . from Africa.

You still have 55% of this article to read. The rest is reserved for subscribers.

Leave a Comment