‘Accuracy in AI is a function of availability of quality data … building NLP tools for low-resource Indian languages is hard’

Natural Language Processing (NLP) is a fascinating part of artificial intelligence. Director of IBM Research, India, one of IBM’s 12 research labs in the world, Gargi Dasgupta talks to Chandrima Banerjee about this in the context of the country’s complex language environment:

How does NLP work?

Natural Language Processing is the ability of a computer or machine to understand human language as it is written and its different characteristics. We break NLP into four stages: understand, classify, retrieve and generate.

So, when I say that ‘apple is really good for health,’ the grammatical rules form the first level of understanding. The second is semantic. When I say apple, you think about the fruit. About 80% of the time, I might be talking about the fruit but 20% of the time, I might be talking about the company. But the rest of the sentence, ‘is really good for health’, implies it is a consumable product and not the company. That is the beauty of semantic understanding – often, one cannot truly understand a word independent of its context.

After this, the second stage is to classify the text into higher level constructs – sentiments, paragraphs, tables, graphs, and so on. The third stage is retrieval of documents based on questions a user asks. And the final stage is to generate text summaries from available information.

Does it work beyond English?

NLP tools for languages like English, French, German benefit from a lot of data in news articles, web pages etc. Data is a big challenge in creating language models for other languages from Asia and Africa. To learn by example, a model requires that you give it lots of sentences to understand. But in Indian languages, the biggest data sets might be a few thousands. So, building NLP tools for low-resource languages where large data sets are not available is a hard research problem.

How do you work around that?

This is where techniques like transfer learning come in – from high-resource languages to a target low-resource language. What we do is try and find groups of languages that are similar, have sentence structures in common. We put them together and try to learn overall behaviour. And then we use a little bit of the data to release a model that tries to understand the Indian lingo.

Is it always accurate?

Accuracy in AI is a function of the availability of quality data. The expectation that AI will work 100% is absolutely unrealistic. We celebrate when we get models to be up to 70% right. That means seven out of ten times, I understand what you are saying. The three times I don’t understand, I will give you a bad answer and you will correct me, saying ‘that is not what I meant, this is what I meant.’ That is called feedback learning. It creates a continuous learning loop. Over time, we close the gap. So, we have made progress. If we didn’t have transfer learning, we couldn’t have made progress in understanding languages natively.

How is language processed natively?

Processing language natively means understanding language constructs, entities and their relations, synonyms, antonyms, phrasal verbs, the overall sentiment etc. An alternate approach is by translation. You take a Hindi sentence, translate it to English, get it answered in English and then translate it back to Hindi. That’s called addressing a native language translation. But at IBM, that’s not what we’re talking about. We are talking about really understanding the language – which means its sentence structure, grammar and other nuances.

What are the challenges for multi-language processing in a country like India?

Most speakers mix languages when they speak. This is called code mixing and creates additional challenges of understanding. One possible way to address this is through speech signals. If I have good models for Hindi and for English and I am parsing a sentence which has an English word, my first approach is to look whether either model says, yes, it recognises the word. If one of them does so with high confidence, I go with that. If none does, I look before and after the word.

There is this whole world of spoken Hindi and spoken English in customer care speech records. The agents usually respond in one language, but customers talk in mixed-language sentences. This automatically creates mapping between multiple languages and the labelled data (which has its labels correctly defined in machine learning training).

What if there isn’t enough data?

AI needs to be trusted. Because we know everything about the data behind it, how the data is evolving. But sometimes, the data is just not there. So, we understand one little part from that little bit of data – the characteristic needed for this AI model – and then generate more of that data. Because we do this using AI, we simulate data that looks like the real world.



Views expressed above are the author’s own.


Show More

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button