India is a country of linguistic diversity, with 22 official languages and hundreds of regional dialects. This poses a challenge for developing artificial intelligence (AI) models that can understand and communicate in different Indian languages. Experts Dr Manish Gupta, Director-Research, Google, and Kalika Bali, Principal Researcher, Microsoft Research India, shared their insights on how to fine-tune AI models for understanding languages in India at a recent webinar.
According to Dr Gupta, one of the key aspects of fine-tuning AI models for understanding languages is to have high-quality and large-scale data sets that capture the linguistic variations and nuances of different languages and regions. He said that Google has been investing in creating such data sets for Indian languages, such as the IndicGLUE benchmark, which evaluates natural language understanding (NLU) models on 11 tasks across 5 languages.
Dr Bali agreed that data sets are crucial for building robust and accurate AI models for Indian languages. She added that the data sets need to be extremely nuanced, capturing the many languages found in various districts of India. She gave an example of how Microsoft Research India has been working on creating a multilingual speech corpus for Indian languages, called M-AILABS, which covers 14 languages and 175 accents.
Both experts also emphasized the importance of leveraging the latest advancements in AI, such as deep neural networks (DNNs) and large language models (LLMs), to fine-tune AI models for understanding languages in India. They said these techniques can help improve the performance and generalization of AI models across different domains and tasks, such as speech recognition, machine translation, sentiment analysis, and question-answering.
Dr Gupta and Dr Bali concluded that fine-tuning AI models for understanding languages in India is not only a technical challenge but also a social and cultural one. They said that AI models need to be sensitive and respectful of the linguistic diversity and preferences of the users, and also ensure that they do not introduce or amplify any biases or discrimination. They urged the AI community to collaborate and innovate to create AI models that can empower and enrich the lives of millions of Indians.
One of the challenges of fine-tuning AI models for understanding languages in India is the lack of standardization and uniformity of the scripts and orthographies used for different languages. For instance, some languages, such as Hindi, Urdu, and Punjabi, can be written in more than one script, such as Devanagari, Nastaliq, and Gurmukhi, respectively. This can create confusion and inconsistency for the AI models, as well as the users who interact with them.
To address this issue, some experts have suggested using Unicode, a universal standard for encoding and representing text in different languages, as a common platform for developing and deploying AI models for Indian languages. Unicode can help ensure the compatibility and interoperability of the AI models across different devices and platforms and preserve the linguistic and cultural diversity of the languages.
Another challenge of fine-tuning AI models for understanding languages in India is the lack of availability and accessibility of the data sets and models for the users and developers who need them. While there are some initiatives and efforts to create and share open-source data sets and models for Indian languages, such as IndicNLP, IndicGLUE, and M-AILABS, there is still a gap between the demand and supply of the resources and tools for Indian languages.
To bridge this gap, some experts have advocated for the creation and promotion of a collaborative and inclusive AI ecosystem for Indian languages, where the stakeholders, such as the government, academia, industry, and civil society, can work together to create and disseminate high-quality and diverse data sets and models for Indian languages, as well as foster innovation and entrepreneurship in the field of AI for Indian languages.
1 thought on “How to fine-tune AI models for understanding Indian languages”