With the success of ChatGPT and other generative artificial intelligence platforms, the need for building Indian language large language models (LLMs) is gaining speed. Though there have been several successes in creating Indian language LLMs, because of the variety of languages that the country exposes there is a need to look at other techniques.
Industry players and AI practitioners believe that techniques like ‘transfer learning’ could be one way of solving the language conundrum.
Transfer learning means a pre-trained model in some tasks or languages is reused as the starting point for a model on a second task.
Citing the example of a human brain, Abhishek Upperwal, founder, Soket Labs, a homegrown AI research firm, during an event recently said, “How our brain works is that we have a linguistic engine which is separate from our knowledge engine. So if tomorrow I learn French, I’ll be able to very easily translate my knowledge and we’ll be able to converse with my friend in French using all the facts that I have already learned.”
He said in the case of LLMs too, one architectural component can just work around the linguistic part and the other can just work around the knowledge part.
“So it would be like an adapter. You keep on adding new languages to the model and the model would just retain the knowledge and you can even separately train the model to actually enhance the knowledge capability of the model. So that is where I see the GenAI research going forward,” Upperwal said during a panel discussion at the Global India AI Summit early this month.
These techniques, he believes, will be able to solve the problems of scarcity of non-English datasets, high compute requirements as well as high costs, to some extent.
How does transfer learning help?
Founders of Indian multilingual LLMs are stressing on the importance of transfer learning techniques coupled with nuances in algorithm and architecture of models to deal with issues of non-availability of data in languages except English and high costs of training.
Transfer learning allows a model trained on one task, such as an English dataset, to be fine-tuned on a new task or language with less data, leveraging its pre-existing knowledge for more efficient learning.
It is more like learning the fundamental concepts and contexts of languages and using that knowledge to translate the same information in other languages, rather than translating word by word or building a model from scratch – both of which require more tokenization and funds.
Tokenization in the context of LLM training is the process of converting text into smaller units, such as words or sub-words that the model can process and analyse.
Innovative approaches for multilingual LLMs
Vishnu Vardhan, founder, SML India, the parent company of Hanooman, said the existing GPT architectures, trained with billions of parameters and trillions of tokens, consumes vast amounts of data to achieve human-like capabilities. “However, many medium-resource and low-resource languages lack the large volumes of digitised text needed for similar performance. By designing data-efficient architectures that require fewer tokens yet achieve English-like performance, we can rely on sufficient amounts of clean, curated text. This text should be domain-specific and diverse, reducing the need for large amounts of noisy data.”
Vardhan said he believes that transfer learning technique combined with architectural and algorithmic changes can, to a large extent, replace the need for training models with local datasets.
“However, we need to modify existing architecture to make them easily and quickly adaptable to several unseen domains,” he added.
Challenges of using LLMs for Indic languages
Soket AI’s Upperwal believes that using LLMs like ChatGPT for Indic languages is more expensive because the tokenisation algorithm is biased towards English, making processing these languages less efficient and more computationally demanding.
“Because we have a large amount of data primarily in English on the internet, the algorithm or the system becomes biased towards English,” said Upperwal. “So, at the core, whenever we are tokenising or converting these sentences, hi, how are you, into mathematical form, for English, the compression is highest, but for Indic languages, the compression is very, very bad and because of that, essentially the compute requirement that has to go into generating any sort of response from these models go up and because of that, obviously the cost will go up,” he said.
“So, if we have to be more efficient, be more context-aware, we definitely have to go and build our own sovereign products,” he said.