



These days, when Hindi speakers need to search for content on the internet, they can simply type a query in Devnagari script on their mobile phone or give the same voice command. But what about those who communicate in a language only spoken by a few hundred thousand people or with a limited online presence? language.

We use technology to tackle low-resource languages, but we believe that these communities are marginalized in other ways, so they have some idea of ​​their needs and wants.Microsoft Research Kalika Bali of , told indianexpress.com: Bali is an expert in natural language processing, where she combines linguistics and artificial intelligence to train computers to understand spoken and written language.

Kalika Bali, a researcher at MSR India, is an expert in natural language processing and leads Project Ellora. (Photo by Praveen Pillai of Microsoft)

Mr. Bali said the main purpose of ELLORA is to ensure that these languages ​​(languages ​​with few written resources, let alone a digital presence) are not left behind when it comes to some of the advances language technology has witnessed recently. I explained that it is to be Using artificial intelligence (AI) and advanced natural language models. More importantly, a digital presence may help some of these languages ​​survive the threat of extinction.

Microsoft Research (MSR) has decided to focus on three of these for now. Gondi with nearly 3 million speakers in Madhya Pradesh, Maharashtra, Chhattisgarh, Andhra Pradesh and Telangana; Mundari spoken in Jharkhand, Orissa and West Bengal; Arunachal Pradesh State Idu Mishmi language.

Bali says Gondi is where the company has had some of its longest-running work, working with CGNet Swara as a partner in Chhattisgarh. CGNet Swara is his online portal where Gondi speakers can report local news in their own language over the phone.

We assisted Adivasi Radio, etc., which was a hub for accessing information over the phone in Gondi. We are working with them to create a machine translation system because one of his biggest problems is accessing information in his native language, Bali says.

MSR plans to test this machine language-based translation system in the field soon. If it works well, Gondhi speakers will be able to access information available in Hindi in their own language. In Arunachal Pradesh, MSR is working on a digital dictionary of Idu Mishmi, in partnership with Pratham books.

Dr. Meenakshi Munda recorded audio samples of Karya’s text to help build a text-to-speech model for Mundari. (Photo by Sunil Bisoyi of Microsoft)

For Mundari, MSR partners with IIT-Kharagpur and GIZ, the German Development Fund. For Mundaris, tasks are specific. We create teaching materials for children because there are very few resources available. The idea is to create an entire pipeline. I’m working on creating a text-to-speech model that allows systems to speak in Mundari. We are also working on a machine translation model. In fact, a small machine translation model is ready, Bali said, adding that he is currently testing the model and plans to work on the speech recognition part as well.

Ultimately, the entire system will be deployed in Mundari, allowing speakers to speak, listen, and type on the phone to access information and use technology in their own language. That’s the idea. Bali also emphasized that models for languages ​​like Mundari do not rely on word-to-word translation. Rather, they ask their native speaker to translate Hindi sentences into their own language, creating resources and datasets to feed computer models.

One of the tools developed as part of their effort is called Intra-Neural Machine Translation (INMT), which when someone is translating between these languages, such as Hindi to Mundari, says: Help predict words. Mundari itself offers predictive suggestions. It’s similar to the predictive text you get on your smartphone keyboard, except that it works across two languages,” Bari explained, adding that such tools would also improve the efficiency of human translators. I was.

Of course, there’s also the challenge of making sure the model works on low-end phones. Given that people in marginalized communities have access to low-end phones, models need to be optimized with this important factor in mind. is to work with We’ve spent a lot of time figuring out how to create, extract, and quantize these models into smaller models that actually work on phones, Bali explained.

On the current topic of Large Language Models (LLMs) and their role in translation tools, Bali said that he has also tested public LLMs for some research. But getting these models to work in languages ​​with limited or no data sets requires more work. How to pivot these LLMs to work in some of the smaller languages ​​is an open research question. And the answer may lie in creating another layer on top of this technology. Or maybe it’s in actually having enough data to feed into the base model. I don’t think we are too sure. It’s a public study to see how we do this, she said.

For now, the ultimate purpose of Project ELLORA is clear. It’s about not widening the gap between those who have linguistics and those who don’t.

