IBM engaged in research to arm computers with Indian languages

IBM is engaged in research on Indian languages in a bid to develop new artificial intelligence applications in key industries

tech, AI, artificial intelligence, robotics, automation, computers, indian languages — IBM’s natural language processing road map in India includes Bengali, Tamil, Telugu, Kannada, Marathi, Gujarati and Punjabi

4 min read Last Updated : Apr 19 2021 | 6:10 AM IST

In March last year, IBM announced the first commercialisation of key natural language processing (NLP) capabilities from IBM Research’s Project Debater, its artificial intelligence system capable of debating humans on complex topics.

While the technology giant builds on its language capability, which feeds into its cognitive intelligence platform Watson, another project in India that is helping build on NLP is a partnership with IIT Bombay, where a team of researchers is looking at the myriad complexities of diverse Indian languages.

“IBM’s AI portfolio is built around automation, NLP and trust,” says Gargi Dasgupta, director, IBM Research India and chief technology officer, IBM India/South Asia.

IBM has expanded its AI-powered automation capabilities with IBM Watson AIOps, which builds on underlying technology developed by IBM Research. Following the first commercialisation of key NLP capabilities from IBM Research’s Project Debater, Watson is even more advanced in the nuances of human language, identifying and analysing idioms and colloquialisms for the first time.

“There are a couple of things that differentiate enterprise NLP from consumer NLP. Explainability and trust for the business to adopt and to make use of NLP is very important,” says Dasgupta.

This differentiates the work IBM does from, say, a Google Translate, whose translation systems are more consumer-facing.

As part of the project with IBM, IIT teams are studying techniques for knowledge representation across documents, graphs, charts, and other forms of multi-media content. This area of research will be critical in helping to develop new AI applications in key industries such as financial services, retail and health care, which rely heavily on rich, multi-modal content.

IIT Bombay has a lab for natural language processing, called Centre for Indian Language Technology (CFILT), set up in the year 2000 with funding from the ministry of electronics and information technology (MeitY). This lab has been interacting with many ministries, government organisations and leading companies, including IBM.

“Natural language processing requires a huge amount of data to apply machine learning,” says Pushpak Bhattacharyya, professor in the department of computer science and engineering at IIT Bombay.

One way the lab gets data is through web scraping for languages like Bengali and Marathi. There are annotators and translators in the team who work from different parts of the country, and create datasets of parallel sentences which are important for machine translation.

Other sources include the Constitution of India, which is multilingual. Many parliamentary proceedings are in multiple languages that add to the project’s database.

Once the data is there, “the key idea is to augment the resource by sub-wording, or breaking the word into its parts. We use help from other languages. For example, a Marathi to Bengali machine translation can be helped by Hindi coming in as an intermediate assisted language, and using linguistic properties which are inherent in the sentence, to embellish the training data,” says Bhattacharya.

“Recently, an important government of India project called ‘Bahubashak’ was launched. It’s a speech-to-speech machine translation. One of the explicit goals of this project is to engage startups in creating data. Under this project we have engaged with startups, which will create parallel data for us,” he adds.

The Bahubashak project, launched last year, is part of the Natural Language Translation Mission (NLTM), which is managed by the principal scientific advisor and MeitY. The main objective of Bahubashak is to have Indian language technology systems and products deployed in the field with the help of startups. The coordinating institutes provide technical and research support for the deployment of these technologies through startups.

IBM’s NLP roadmap in India includes Bengali, Tamil, Telugu, Kannada, Marathi, Gujarati and Punjabi. The partnership with IIT Bombay is one of the projects working on the language piece.

In the end, as part of its pillar of trust in AI systems, Dasgupta says, “the overarching goal should be to minimise and reduce biases.”