CS224U: Natural Language Understanding

Podcast episode: Kalika Bali

May 23, 2022

With Chris Potts and Aasavari Kakne

Giving a TED talk, linguistic diversity, code switching and large language models, the Indian NLP scene, empowering women with language consultation work, Wordle, and "once a linguist, always a linguist".

Show notes

Transcript

Chris Potts:All right! Welcome, everyone! I'm delighted to have Kalika Bali here with us today. Kalika is a Principal Researcher at Microsoft Research India, and she has done incredibly vibrant research on a really wide range of topics, including crowdsourcing with diverse crowds, and also numerous topics in multilingual NLP that span from part-of-speech tagging to sentiment, on up through full chatbots. And all of this research is with the goal of empowering people with technology and also making sure that no one gets left behind by the progress that we're all feeling.

So, we have lots of exciting topics and questions for you, Kalika. I thought I would just dive right in. I really love the TED talk that you gave, which we'll link to, called 'Can language technology help language communities?' And I was just struck and charmed by the way you described yourself as "A linguist by training and a technologist by profession." It's a lovely formulation. Can you say a little bit more about what led you to that formulation?

Kalika Bali:Yes. So, to your question, I am a linguist by training. I am also a lapsed chemist, because my undergrad degree is in chemistry. But, thereafter, all my professional training is as a linguist. I got really interested in technology because when I left York, where I'd gone for my Ph.D. Studies, to go teach in the University of South Pacific in the Fiji islands, that was the time speech technology was kind of making some waves. People were talking about text-to-speech in an articulatory synthesis paradigm. And speech recognition was still a difficult thing to solve. And I got into text-to-speech synthesis – very, very rule-based, very, very articulatory-synthesis kind of thing. And it was exciting for me. And it helped the technologist in me come forward.

I know this is a cliche, but HAL, Space Odyssey – I'm a science fiction fan, and having watched that as a very young person, as a child in fact, that was what fascinated me. Here was a machine that could talk, and made perfect sense. And the fact that I was a phonetician at the time and speech technology was kind of taking off in a way, led me to take technology more seriously. And also at some point, I realized that I actually wanted to go build things and not just do academic research as a phonetician. So, I think that's how the transition happened.

But once a linguist, always a linguist. Like my supervisor told me that, even if you don't understand a language, the minute you are trained as a phonetician, you're sitting in a group where everybody's talking in a language that you don't understand, and what you start doing is, you start listening to the pitch patterns, you start listening to the information, and you go like, "Oh wow, that's an interesting sound that they made."

So I don't think I've lost touch with my roots as a linguist, but what has definitely happened is that I've become very theory agnostic. As a technologist, I have had to become theory agnostic. I have also become much more practical in the ways I choose what flavor of linguistics and what parts of linguistics can inform my work. So yes, a linguist by training and a technologist by profession. That's how I am, which also means that I'm kind of an animal that doesn't know where it belongs. I'm like a very mixed breed animal.

Chris Potts:But so, once a linguist, always a linguist. But not once a chemist, always a chemist? Or are you still a chemist as well?

Kalika Bali:No, I'm not a chemist anymore!

Chris Potts:When did you lapse?

Kalika Bali:Immediately after my undergrad. At that time, I loved chemistry. It was like one of the first things that ignited my interest, because, you know, the whole thing of how an electron is composed and that uncertainty principle and the electronic clouds, that's the thing that got me really interested in the whole thing. But then, the way it was taught was very dry – it shouldn't be, but it was a bit like that. So I kind of moved away from chemistry. I actually went into journalism for a very short time and then looked at this thing called linguistics. And I was like, "Oh, this sounds interesting!"

Chris Potts:Wait, so, at what point did you discover linguistics? As an undergrad or after?

Kalika Bali:No. For my Masters.

Chris Potts:Masters, ah.

Kalika Bali:It's a very weird story. There's an entrance exam in the University that I ended up doing my Masters in. And I went to pick up the application forms for my sister, who's a biochemist. And I said, "Oh, this is a nice place. I wouldn't mind studying here." So I picked up the prospectus for myself and I looked through it and I thought, "Oh right. This linguistics, I know nothing about, but it sounds interesting, the way they've described it." And ancient history, just because I like to read about ancient history, also sounds interesting. So the entrance exam is held in May, in peak summer in Delhi. It's really hot right now. It's like 47 degrees Celsius, 49. I don't know what it is in Fahrenheit, but it's really, really hot!

The linguistics entrance exam was in the morning, and the ancient history one was in the afternoon. And it was really hot and there were no fans, there was no electricity in the place where we were. So I took the linguistics exam and skipped the ancient history because I just couldn't sit there in the heat anymore. And, voila, I actually made it through. And then, I just went on to do linguistics. It was love at first sight. It really was like, I found what I wanted to do.

Chris Potts:It's always so weird to think about how our lives are basically made of these strange little moments where, on the spur of the moment, or for what seems like a random event, our whole fate is set for us. That's incredible.

So, that's how you ended up being a linguist? And then you hopped from phonetics to speech technology and the rest is history?

Kalika Bali:Yes, absolutely.

Chris Potts:And I was also just very curious for myself. What was it like to give a TED talk? It seems absolutely nerve-wracking to me.

Kalika Bali:It was nerve-wracking. Actually, there was a team that really helped me prepare. They asked me to write down everything and then they went through it. They gave me ideas on how it would be for a layperson. They were there all through the way. There was one person who was just completely assigned to me who was just shadowing me in all possible ways.

But I wasn't prepared for the post-TED thing. So the TED was fine. I went on the stage, I spoke, which was okay. I had a bit of nerves, because I saw all these other people in the session before me. And I was like, "Oh my God, they're such celebrities! And what do I know about anything?" But it went well.

What I did not expect was post-TED. When TED posted this, the amount of mails and people reaching out to me. In the beginning, I actually would sit and respond to every mail that I got, but then, in a month or so, the amount of mails I started getting was just so much that I really did not have the time to go through and respond to everyone. So if anyone sent me an email, I'm so sorry if I didn't respond!

And a lot of very interesting connections – people from very different parts of the world, reaching out and telling me about their experience with their languages.

Also, one other thing that TED requires – they're meticulous about consent as well as references. So I actually had to spend a lot of time digging out actual paper references for everything, including the people that participated in some of the studies that we did, that we had consent from them for the photographs that we used or just to participate in this. They are meticulous about it. I had to create this huge folder. Every single thing was referenced in it and, we had consent for everything. So, that was another thing. The legal team is amazing.

Chris Potts:Oh, is that why? Because they're kind of functioning to do fact-checking to make sure you've not slandered anyone or something like that?

Kalika Bali:Yes, or said something that is completely wrong or using people's pictures and personal information, et cetera, without consent

Chris Potts:For the mail that you got, I'm fascinated by that. Was it a lot of people telling you that Siri doesn't work for them or something?

Kalika Bali:Well, actually I got very few of those kinds of things. I got a lot of mails about people wanting me to do something about their languages, and from very different parts of the world. And, I really felt bad about saying, "Yes, I'm really sorry, but I don't have the bandwidth to do anything." Lots of people wanted to collaborate or just ask for help for creating technology for the language.

That was the bulk. I also got hate mails. Some people said like, "Why is Microsoft doing this?" When you are a Microsoft researcher, you get used to some people trolling you.

Chris Potts:The question from this hate mail was "Why is Microsoft working on language technologies that can be used around the world?"

Kalika Bali:Yes. "Why is a Microsoft researcher talking about technologies for low-resource languages." Yes.

Chris Potts:I don't want to leave the TED talk behind because I really do think it's lovely. We'll link to it. But could you summarize a little bit – the overall takeaway of the talk?

Kalika Bali:Yes. In the TED talk, what I was trying to convey was how technology can actually have a social impact. It can actually impact people's lives, language technology in particular. And how different communities view how language technology can enable them. I tried to give examples from the work that we've done and how, sometimes, you have to take a more creative approach and you might not be doing state-of-the-art things, but it doesn't matter because the ultimate goal is to have an impact on the people who are using it. Yes.

Chris Potts:That's really wonderful. Aasavari, do you want to dive in? Go for it.

Aasavari Kakne:Thank you, Kalika. Around this topic: we saw that in English language, BERT, RoBERTa, GPT-based models basically changed the whole landscape of NLP. But just after English, there are other languages that are spoken by almost as many people, but we didn't see any BERT-like models in them. And that's why a huge part of the progress is left behind.

Kalika Bali:Yes.

Aasavari Kakne:So what do you think is happening there?

Kalika Bali:A lot is happening. This is a chicken and egg problem, right? The reason you'll have such good models for English is because English is the dominant language on the Internet, right? And all these models are basically trained from data scraped off the Internet or available in the digital world. So, most of the other languages have relatively less presence on the Internet. It's really a data problem, but a lot is happening.

You might know that Google has released some IndicBERT models. There is an organization called AI4Bharat, who is working towards creating data at least some baseline models for Indian languages. Then, mBERT, the multilingual models, do pretty okay for some of the languages. I wouldn't say that they work as effectively for all the hundred-plus languages that they claim to be trained on, but I'm kind of a little optimistic.

Sometimes this is a chasing SOTA kind of a thing. For someone who's got a more of a practitioner's point of view, this irritates me. "Fine, you've got like a one-point increase or a two-point increase, but you know, how is that actually having any effect on real world?"

But I'm still very hopeful because a lot of young people really want to do this kind of large PaLM work. Right. And a lot of them are getting more and more aware of how this is not working across many languages and are putting in the effort to make it work. So, a lot of people I see who come from places around the world, which are multilingual, are very interested in multilingual models and are working towards those. So, I'm kind of hopeful that things will be very different in a few years.

Chris Potts:Can I pick up on a phrase you used though? This is one thing you might worry about. So you said that the multilingual models do "pretty okay" at languages that aren't English. On the one hand, that's better than nothing. On the other hand, what is it going to take to get us out of this trap of doing pretty okay on everything but the high -resource languages?

Kalika Bali:Yeah. Data. And some actual intentional focus on making things work beyond English and the top few languages. And also evaluation, because we don't even have test benches or test data sets for a lot of the languages that these models are trained on. So, though we might say, "Okay, like a hundred plus languages," maybe we have data sets to test those models for the top 50 languages. So, we don't even know how badly they're doing until we actually put them out to work. And then it's kind of late because they're not going to really be very effective in the field. So, yes. Data and intentional focus on making them better.

Chris Potts:Intentional focus. I like that phrase too. That actually kind of leads me into a question that I sent you beforehand – in your view, imagine the counterfactual world in which the whole field of NLP is focused on the end goal of building technologies to be used in India, as opposed to whatever we're focused on now. If that was our end goal – technologies to be used in India – how would our priorities and how would the field be different?

Kalika Bali:I don't know that the field would be very different in the end goals that we have. I think, as a field, our end goals remain the same: we want to build technology for languages, right? For natural language, as a top entity. But what would definitely happen is, we would have a lot more data for Indian languages. We would have models that actually work for Indian languages. We would have the technology, which would impact people in India, where actually language technology has a bigger use case than in certain other parts of the world.

We have multi-linguality entrenched in our social fabric, right? We have communities which are so isolated and you know, do not have access to information, education, healthcare, etc. Language technology can really help with those. So, I think what would've happened is we would've had a GPT-3 for maybe Hindi and other Indian languages before we did for English. And, because of the diversity of languages in India, people would've been much more aware about the need to tackle linguistic diversity. Maybe people would actually call out English rather than being the default language for all.

Chris Potts:But wait. If we just had made the GPT-3 for Hindi, wouldn't it just recreate in India the problem we have with English now, which is that the dominant language would be one we all focused on?

Kalika Bali:I don't mean a GPT-3 Hindi. I mean a GPT-3 for Indian languages.

Chris Potts:Oh yeah. No. So, that would have a very different character. Maybe that would mean that, even from the get-go, even before GPT-3, people would be worried about topics like code-switching and other things that are pervasive in the data.

Kalika Bali:Code-switching, multi-linguality, how it has to work across different languages with different resources.

Chris Potts:I know that Aasavari wants to nerd out with you a bit about code-switching! Do you want to ask, Aasavari?

Aasavari Kakne:Yeah, I'd love to! Indian languages are kind of like sister languages and share a lot of grammar rules and the way sentences are structured. Suppose we had a GPT-3 model in Hindi. Would it be much easier to just branch out into other languages, for example, just fine-tuning over other languages? How would that work? Can we imagine it?

Kalika Bali:Yes. So, some of it is definitely true. Some of this would be easier because we are working with related languages. But then we do have a lot of language diversity within India as well, right?

Some of these things we have tried. For example, like I talk about this in the TED talk also, we created this Adivasi radio for the Gondi language. Now, Gondi is a Dravidian language, but it exists surrounded by Indo-European languages, Indo-Aryan languages. We didn't have a Gondi TTS [text to speech model], but we just fine-tuned a Hindi one that was available and it worked for Gondi.

We are trying to do similar things with machine translation for some of these really low-resource languages. And definitely in some cases, it's better to start from a Hindi model than from an English model. In some cases, it's better to start with the Spanish model than a Hindi model. So it really does depend on the languages that you're tackling and how related they are. Grouping by language families does help to a certain extent, but not fully. Tweaking afterwards with a little bit of data is something that we are experimenting a lot with, and it kind of works. We need to do more work around trying to understand how to do it better.

Aasavari Kakne:That's so awesome! Thank you so much!

Chris Potts:Is the Spanish thing something that you can explain as a linguist?

Kalika Bali:Yeah. So, I think it's to do with the correspondence. In certain cases, it works better, for acoustic models or for speech related things. Spanish tends to have one-to-one correspondence between sounds and autography, which is true for almost all Indian languages.

Chris Potts:Oh, interesting. So a very abstract connection.

Kalika Bali:Yeah, and also in some cases, it's just how words inflect with grammatical categories. So gender, number, person are marked in certain Indian languages. There is long-distance agreement also that happens, and those kind of things actually are found in some European languages, but not in English, so it can tend to work better.

Chris Potts:I think this will be fascinating for some students in the course right now because for their final projects, they're testing in various ways whether typology or lexical overlap matter more, when you're trying to do translation or even natural language inference, where you, say, train on one language and evaluate on another.

Kalika Bali:That would be fascinating. I'd love to hear more about what the students found.I think both help, but these are all experimental things to empirically study.

Chris Potts:If both help in the era of these large language models, you should just pre-train on data from both languages and hope they both help, I guess.

Kalika Bali:Which is what is helping with a lot of code switching things now. If you have the data from both languages in the pre-training phase, and then you switch it with some code-mixed data, now it seems to work fairly well.

Chris Potts:Is this eroding, then, the need to do language identification per token, or even to attend to the code switching, if you can just have one model do it all?

Kalika Bali:So, the jury's still out on that, but it does seem to be the case that if we have a pretty good representation in the pre-trained model of the two languages, then tweaking it with a little bit of code-switch data at the end seems to be giving very good results. But you need language identification for other things as well. Especially in speech, it's a bigger problem, because you have to identify from speech.

Chris Potts:Right, right.

Aasavari Kakne:Especially in informal settings, sometimes even one word can have two languages mixed. So this is a common occurrence across all of India, and I suppose that has presented a lot of challenges.

Kalika Bali:Yes. So that is much more difficult to still solve. I'll give you an example from Hindi. You have a word, the root is Hindi, but then the ending might be English. So, you'll have something like "chalo" is "to walk," right? And "ing" is the English ending"-ing". And so you say "chaloing", and those kind of things are still difficult to tackle.

Chris Potts:You mentioned language identification – is that a solved task?

Kalika Bali:To a certain extent, yes. But then how is the language written? So, for example, a lot of Indian languages are also represented in Roman transliteration, especially code mixing in social media settings or chats, et cetera. And there I don't think it's a completely solved problem. And speech, it's definitely not a solved problem. People think it's a solved problem because we know what to do. We know the steps in the pipeline. Of course, if everyone gave us all the data that we ever needed to do this, then everything is a solved problem. Data becomes the biggest bottleneck, as we all know.

Chris Potts:So I don't want to put words in your mouth, but reflecting on the conversation we've just had, it sounds like, in general, you view these large language models as something that could be empowering. There is another story about how they might be centralizing power and even centralizing work on specific languages. But it sounds like you're more optimistic about this. Is that correct?

Kalika Bali:I'm optimistic from a technological point of view that they can help solve the technology problem. But I do think about the conversations that are happening around the centralization of power, of whose language, how are these used, how are they trained? It's almost impossible in computing terms.

I'll talk from a community perspective, right, before the LLMs became the de facto way of doing NLP. There is a small community in Finland, for example, or a small community in the Russian Arctic area, and they want to create a machine translation for them, for their community, for their purposes. What they did was they went ahead. There were enough open source tool kits available. And if people who could help them, walk through the entire process, collect this data, put it here, do this, do this, with a little bit of help, they could actually do that.

With the large models, that's just not possible, right? And so either you have to wait for some bigger organization – either a bigger university or a big tech company, or somebody who has enough resources to focus on what you require as a tiny community, which may or may not happen in a 100 years – or you fall back on the older tools, which people may maintain, may not maintain. They do these things then fit into the entire ecosystem, which is now pivoted towards the larger LM-based technologies.

Does your stuff talk to others? There's a whole host of problems around that. So I do think that these conversations are really important to have, but my hope is that, as these questions are raised, I go with the belief that we all want to attack the same problem. And it's not the case that people who are building technology and building LMs are interested in doing so in an inequitable manner. But maybe they need to be more aware of what they're doing and how it impacts a lot of things around. So I'm all for the conversations. I'm just hoping it creates an awareness and gets people to do something about these things, right?

One of the things that we are very interested in – and I hope like a lot more people focus on that, and they are – it's quantization of these models. Like we want these to work on really tiny devices in the field, right? Even if you've got a model that works for, say, Hindi or Bangla, the major languages, I can't use it because I need it to work in resource constrained environments, on edge devices or in lower resource infrastructure, et cetera. There are a lot of people who are focusing on that. And that's a good thing because they realize this is the gap that we need to bridge.

Chris Potts:I love that perspective. That was part of what I was thinking about when I asked about how the field might shift if we were developing technologies for India, which should just be like, "What if you are much more resource constrained?" And actually that might be more of a focus if we were all more focused on developing technologies as opposed to just pushing ahead on state of the art on benchmarks. And I'm feeling like that's starting to shift in a healthy way as well. People are more aware that a small model could be important. And even if there's a loss in performance, maybe there's an overall gain. And I find that very encouraging overall.

Kalika Bali:Yeah. I agree.

Chris Potts:Aasavari, are there any student questions before we move on to a slightly different topic?

Aasavari Kakne:Not yet.

Chris Potts:So Kalika, I'm sort of curious myself – and Aasavari can pitch as well – to just hear your perspective on the Indian NLP scene. What's going on that's cool? And what are especially junior researchers really focused on? I've noticed that you work with a large range of people, and I'm curious to know what they're thinking and what the new things are and stuff like that.

Kalika Bali:So, I asked this to people in my team. We have a research fellows program that you might be aware of where we have these pre-doc students, also people who just completed their undergrad. They come in, spend a year or two years at the lab doing research. So there's a research fellowship. So, we have a lot of research fellows and interns. At that level, people are just as interested in doing whatever anyone else in the world is doing, right? They do want to work on these large models. They want to work on, whatever it is, the flavor, starting from when it word2vec to Transformers, et cetera. So, they want to work on the latest tech and learn. A lot of them have entered NLP from ML backgrounds. So, they're much more interested in this kind of things.

But we do also have a couple of linguists who are very focused on learning how NLP can be used in the Indian context. A lot of people are very focused on multilingual NLP because they see in front of them the need for this. So, in the Indian NLP scene, I think for the past two years, there's a lot of focus on Indian language NLP. There are government initiatives, big government initiatives. There are lots of non-profits coming out of academia, the academic world, the big techs, they're all kind of currently focusing on how to create technology that's usable for Indian languages.

But I don't think that, at the student or their junior researcher level, there's much difference in what people want to do from anywhere else in the world, except that there's this multilinguality lens to it. A few of the junior researchers are also quite focused on the fairness aspect of language technology – responsible AI kind of things, especially related to technology built on large models. So, that's also something that people are focusing on a lot.

Chris Potts:That made me reflect a little bit on where students are coming from here in California. I feel like, 10 years ago, the story was almost always that they started in machine learning and then kind of drifted into NLP as an applied area. And a minority would be kind of coming from linguistics or something like that and end up doing NLP. But now in 2022, it feels like they come from all over. Some of them just know about NLP, from high school, which is already a change, but many of them seem to have come from computational social science, psychology. It feels like it's diversified. Is it still really very ML focused as a path in India?

Kalika Bali:It is. And I think it has to do with how linguistics and computer science and any of the, say, social sciences interact with computer science in India. They're very siloed even now. There are very few places in India where the social sciences interact with CS. I think with the exception of one or two places, almost no computer science is taught as part of the linguistic coursework and vice versa. A lot of our very, very bright students who join our fellowship program come from the CS stream with very little idea about how language works, and vice versa. The linguists who know how language works, come with very little understanding of how computational stuff works.

Chris Potts:Wherever they do those exams that you took, you should hand out portable fans to the people for the linguistics exam and get them all to do that one, instead of whatever other thing they were going to do, physics or computer science.

Kalika Bali:Yeah. Well, it's much better now!

Chris Potts:I see.

Aasavari Kakne:Actually that was the case when I was a student four years back also. So, it goes on, but I definitely agree with the point that the mixing of humanities and STEM is very little. The college I attended, it was a complete engineering university. So there's no school of education. There's no school of management, like very little things other than STEM going around. So, that could be one factor definitely.

Kalika Bali:Yeah. Yes. Very little, there's very little mixing. There are a few premier institutions, but they can only cater to so many. So there are a few institutions where this is done, but not at scale.

Aasavari Kakne:Yeah. We just have one student question around the point that when junior researchers are coming into this NLP field, what are certain skills that they should definitely focus on?

Kalika Bali:When they're coming into the field? I think, these days we make a joke that ML is something everybody knows from the minute they're born, because everybody seems to have done a lot of ML stuff. So I think basics there would be necessary.

When I am looking at somebody to work with me, I very strongly look for what I call the language intuition. You don't necessarily have to be trained as a linguist, but you should have some intuition about how language works. What are the peculiarities of this form of human communication? And so those are the things that I basically think about, and it would be a great plus. This is just me, right?

It's always good to know what's happening in the field and related fields. So not just like NLP, but say, NLP plus HCI, NLP plus computational sciences, NLP plus domain specific things, healthcare. You don't have to know everything about everything, but some idea how NLP is interacting with other fields, because that's ultimately where the real impact is going to be. That's my view. Others might have very different views.

Aasavari Kakne:Thank you. I think you are the second guest speaker who reaffirmed that diversity in backgrounds is a great thing to have.

Kalika Bali:Yeah.

Chris Potts:And you can see that also in the topics that Kalika listed. And it was interesting to me, the list of topics you gave, because it corresponds closely to what the students in this class want to do – which is like ML things, of course, some standard natural language understanding things, but also low-resource MT and low-resource natural language understanding, and also issues of fairness and bias and so forth. But I'm wondering, for those topics – say, take the case of multilingual, because we've touched on it already – how do you reckon that's different for the students that you interact with versus the ones that are at Stanford, in terms of the kind of focus that it gets or the perspective they bring?

Kalika Bali:Yeah, so I think we have a very big advantage in India, which not everyone in Stanford might have, which is that almost everybody speaks at least two languages. The usual is three to four. So everyone is very aware of at least differences in languages – that languages do have this variation, languages do differ. We use languages differently, and we live in it, and we live it, right? Like everywhere around us, people use different languages in different ways. So I think that's the biggest advantage that we have in India – even people who have not really thought about it, the minute you ask them a question – "So, would you say this in the same way if you were talking to your close friends," and then you to get, "Of course no, of course not! I will be talking to them in a mix of English and whatever their native language is." And so, it's a lived experience for them. So very hard for them to kind of get away from the multilinguality. Very few people don't have that perspective.

Chris Potts:Well, that's fascinating already. Take chatbots or something. If they're supposed to be Californian chatbots, I guess they can just use English with everyone, first name basis with everyone, right out of the gate, all of that stuff. What about in the Indian context? Are there now dozens of new things you need to worry about, in terms of politeness marking and code switching and language background and so on?

Kalika Bali:Yes! And that's why we are sticking to monolingual chatbots, because it's really otherwise difficult! The study that we did on chatbots, where we tried to see code switching and the people's reaction to it, it was really amazing to see the variation in people's response to a code switching chatbot. There was a very clear set of people who were like, "Wow, this is fantastic! It's talking so naturally! It's code switching! This is what I want! A bot that can do that? That's amazing!" But then there was a whole lot of other people who thought, "I don't want the bot to do that."

Why is it so? There were two viewpoints there. One was that, "I don't want the bot to do that because it's trying to be human when it's not." That's a thing that you get everywhere, right? A lot of people have that reaction to any human-mimicking technology: "Why is it trying to be human?"

There's another set of people who thought that a bot should talk properly, whatever that "properly" is. And it should not try to lower the standards somehow by mixing or using first names, et cetera. It has to be more professional and more formal and therefore not do all these crazy things.

So there, right at the start, we have this divide. As a bot maker, how are you going to know what the person's point of view is? And that's why we thought about nudging as one way. You start monolingual, put in some code-mix things, see how the user responds, and then adjust accordingly. But it's a difficult thing to implement.

Chris Potts:I just want to say also that, in general, it is so refreshing that you're doing this work, because throughout NLP, we just don't do nearly enough study of how people apprehend our technology. When we do it, we do it with very narrow, satisfaction-style surveys. We don't ask them questions that get at their ideologies about technology or about language or about what we're doing to them, none of that stuff. You can see in the description you just gave, how much we are missing about the impact we're going to have by not knowing these things, or knowing them too late, once you've deployed something.

Kalika Bali:Yeah. You know, there are lots of stories, especially around data collection, which was something that I feel so strongly about – how we collect data. As researchers, we do research, publish, build on top of it. Technology companies make money out of it. The data providers are invisible in the whole process, right? What do they really understand about our technology? It's hard to explain, even to a technologist, how these models and these technologies work, right? Then how can you even explain that to a simple layperson, especially someone who might be digitally naïve, and not that exposed to technology and the rhetoric around it, and living in rural or semi-urban areas, lower down the pyramid in socioeconomic class?

When we were collecting speech data in one of the Indian languages, Oriya, the way it was explained was that, "You know, on your phone you have these voice assistants. There's things like Google Voice or Siri." (Siri, nobody uses at that level.) "What you're going to give is going to help us do that for Oriya," which is a fairly simple and not incorrect description. What the women mostly understood was that it's going to be their voice that's going to be the voice of the assistant on the telephone. They thought they will be explaining everything about their language, and their state, and their region, and their culture on the phone to them. There itself you see what a difficult problem it is.

We talk about explainable AI in a very different sense, but this portion of explainable AI, at least to people who are providing things that are building blocks for us, to understand what exactly is happening, I think that's fairly important.

Chris Potts:This whole topic of crowdsourcing is part of the reason why I was so fascinated to talk to you. That's a nice transition point. I've done lots of crowdsourcing, but I've always done it on standard English using Mechanical Turk. I feel like I'm good at that, but your work has made me aware that, actually, that means that I have almost no expertise in crowdsourcing, in general, because I don't have to confront really any of the issues that you just described, and all the others that you've described in your research. Do you want to say a little bit about how you think about crowdsourcing?

Kalika Bali:Yeah. It's been a learning curve for me as well. I've learned so much in the last couple of years that we've worked with Karya, which is the platform that we have for doing all kinds of crowdsourcing tasks. The thing was that Karya was built to create income avenues for a certain population, a certain demography, which is not exposed to things like Mechanical Turk, et cetera. Language data became the most popular – surprise, surprise! – source of income for that population. Now we've done lots and lots and lots of data collection studies.

We found out how to incentivize people. What makes them want to do this varies so much. Money is definitely a factor. Fair payment for it is definitely a factor, but there are also other things. Pride in their language. Wanting to do something for the language – which, the smaller the community, the larger that pride is. If you go to, say, an engineering college, a top engineering college, and ask a student to give data, they're not very particularly interested, you know? If you go to a small village and ask them to give data, they're very interested, overall.

Chris Potts:That's so cool, and it's because they want their language and culture represented in these larger technologies?

Kalika Bali:Yeah. Yeah. Yeah. The other thing is that, we found that gender plays a big role. Intersection of gender in the space is amazing. We thought that, like most narratives around crowdwork, women in certain societies don't have access to going out and working, for whatever social reasons. They might not be allowed or they might not feel comfortable going out and working, and this would be a great place for women to work on. You're on a phone, everyone has a phone. You can do whatever you want. You'll actually get paid for it, et cetera.

Empowering women was one of the big things for us in Karya. But, when you actually dig deeper, which we did with a very extensive study that we did with women workers on Karya, we found out women really want to do this. This is like, in some sense, it gives them agency, whatever it is.

But this whole concept that women can do it anywhere – no. There's household work, there's family stuff, and only then you have the time to do this, right? You finish all your work and only then you make time for this. There's no space. There's no concept of private space where women can sit and do this work, right? We interviewed women who sat in their bathrooms and washrooms and did this. They sat on the roof of the house at night when everybody was sleeping, and did this. A lot of them even did not tell their families that they were doing this. The ones who told their families also had some kind of a reaction from the families, ranging from anger to just people making fun of them. Mocking them – "Ah, you are not going to get paid, right? You've become a earner," and making fun of them in those ways.

But they still wanted to do this, so this must be so important. Even after having all these problems, they did want to do this. Money was only a part of it. Money was definitely a part of it – that they actually got money in hand that was theirs – but they still wanted to do this for all the other reasons, to show the world that they're useful.

That's an insight we would have never got. Then you think about, "Okay, so you found this. So what? What are you going to do about it? This is a social problem." That's a very easy way to brush it aside that, these are problems that are not something that we as technologists have to think about.

If you think in very simple terms, we could provide privacy functionalities. We could provide other functionalities for women who want to keep this, not in the open, not in front of their families. We could provide support groups for women, so that they could discuss the task. We could have forums where women could discuss this. They could help each other out with certain things. There are lots of things that we can technologically do to make it a better thing for the people who are providing all this data that. We would be unable to do anything if they didn't do the data giving!

Chris Potts:Right. That's fascinating. It reminds me of one of my teachers in graduate school, Sandy Chung, who's a linguist at UC Santa Cruz. She's done a lot of work on how you might take experimental methods out into the field in a global sense. She has all these things that we should be aware of – like, the Western ideal for an experiment is that everyone is alone, making independent judgements in a private room. That's what we regard as independent judgment. Whereas, in many communities in the world, the normal thing is to do those things communally, in an open space and see how everyone else is feeling. If we just impose the Western ideal on to those communities, even if you could get it to work, you've created a very strange social dynamic that is going to impact the questions you asked and the answers you get. Maybe the smart thing to do is, adapt them to the local environment and culture, and think about the questions from that perspective. I just find that so refreshing, and eye-opening, too, about the things we take for granted, of truisms about science, and how those might be things we should reflect on.

I have to ask one more specific question, just because this is so much on my mind and the minds of my students. How do you handle the issue of payment? What is fair pay for the kind of crowdwork that you're doing?

Kalika Bali:So, obviously the customers of the data want as much data for as little money, free if possible, but we're fighting against that! Till now, what we've done is, in India, minimum wage is a state subject. Each state has its own minimum wage set, and we have worked out, with the minimum wage as the baseline, how much each task is worth. We've got a formula for that, which is above the minimum wage, so we do pay them fairly.

We are also thinking about this concept of crowdworkers having a stake in the data itself. This can be done. In some scenarios, it can be done. It can't be done if you're collecting for a research project, a little bit of data on your own, honestly, but with crowdworking platforms, they could do that, the idea of data commons. If actual crowdworkers continue to get some kind of a royalty every time the data's purchased by somebody or the other. There are questions of data sovereignty there that become important. Giving them stake also becomes important.

We're still working around those things. The payment is pegged to whatever the minimum wage is, and above that, how much is done. At least for even the most simple tasks, we maintain the minimum wage, which is a lot more than most people pay.

Aasavari Kakne:We have a question about the bias we were talking about. We discussed gender bias in detail, but many times Indian languages have two dialects at least, like a urban dialect and rural dialect. Sometimes, the urban people get emotional about it that – using rural language is, kind of, "bad." How do you deal with such issues?

Kalika Bali:Yeah. That is the same as for code-mixing. People saying that the bot should do the "proper" thing. I think a lot of people are becoming more aware of it, the need. At least on the technology point of view, the need for representing a lot of dialects.

In the past few years, whatever major projects there were, like a BMGF-sponsored [Bill and Melinda Gates Foundation] project at one of the institutes, Indian Institute of Science in Bangalore, on collecting health, agriculture, finance, some domain-specific speech data for building recognition systems in various Indian languages. They are going after all the dialects of Hindi and some of the dialects of Bangla. They are going after the major dialects. I see that being replicated in a lot of places, because what people are realizing is that these dialects are big enough to be languages. It's just the power dynamics. They're not tiny groups or a minor variation on whatever a standard might be. There are big populations that speak these dialects.

Now, the other way round – I'll tell you another interesting story that we got from the gender study that we did with the women. One of the women, actually, she said that this was great that she was doing this speech recording, because she's always been told, at school and by other people around her, that her dialect is lower class. She doesn't speak the proper one. She speaks a lower prestige version, a dialect. And this work gave her the confidence that this is fine.

So those are biases. I don't know how do we deal with them at scale. I just think that by just making sure that they're represented, we go a little bit in trying to address the biases. But everybody wants to learn English and there's a reason for it. It's an aspirational language. There are economic reasons, and you can't say that, "I will use my ability to read, write, and understand a prestigious language or dialect, but you should continue to use your dialect. You have the burden to preserve your dialect, whereas I can do whatever I want."

Aasavari Kakne:That's awesome.

Kalika Bali:You have to be sensitive to that as well.

Aasavari Kakne:Yeah. I think your work is great in the sense that they are giving equal platforms to different dialects. So it helps people to see that, "We can't comment on people's dialect being lesser or higher than us."

Kalika Bali:That's the bigger social problem. We are taking very tiny steps towards that.

Aasavari Kakne:That's awesome. I think we have handled all students' questions. Just one more: apart from Hindi and English, what could be other interesting Indian languages that are interesting in code-mixing?

Kalika Bali:Oh, almost all Indian languages code-mix. Every single one of them code-mix. Bangla, Tamil, Telugu. Telugu has such a lot of code-mixing! Every single Indian language code-mixes. In a lot of places, we think of it as code-mixing between English, but we also have code-mixing between whatever is the dominant language and a non-dominant variant or a non-dominant language of that area. For example, Hindi is code-mixed a lot with many languages. Tamil might be code-mixed a lot with some languages in that area, and some of the minor languages in that area. Standard versus non-standard varieties code-mix. Everyone code-mixes. If you take any of the constitutionally-recognized languages, almost all of them code-mix with English.

Chris Potts:So, Kalika, overall, on balance, do you think that the rise of these smartphone assistants is going to be better or worse or neutral when it comes to the fantastic linguistic diversity of India?

Kalika Bali:I am hoping that it's going to be better! Obviously, there are areas where the languages are endangered, but in most cases, these are very vibrant and robust linguistic communities. I don't see any reason why Tamil language community is going to disappear or merge into, say, a Hindi speaking one or an English speaking one, right? We have very, very robust language communities. Of course, there are smaller languages which have endangered status, but in most cases, we have very vibrant language communities.

Chris Potts:I have to remind myself throughout this entire conversation, since I live in a world of text, but for a lot of the research you're talking about, it's actually on speech data, which seems like adding a whole new layer of difficulty onto an already difficult problem. How does that shape your research?

Kalika Bali:One of the things, of course is, a lot of the languages that we work with are not written languages, or they use a script, or there's very little written work available, or they use borrowed scripts. Or there are controversies about which script to use and which not to use. We're taking it step by step. I'll give you an example of this tiny language community that we're working with, called Idu Mishmi, which is very tiny – less than 15,000 speakers in Arunachal Pradesh, which is in the north east of the country.

Now, that community, they actually came to us and wanted to see if we could collaborate with them. Most of the speakers of the language use English or Hindi very efficiently. They have phones, smartphones, et cetera. They're not technologically naive, but for them – if you go there and say, "Oh, I'm going to build technology that's going to change your lives," their response is, "I don't want my life changed. What I want is, I want to be represented in the digital world. I want people to know that we exist. That our language exists, our culture exists. We as a people exist." Right?

The first thing they wanted from us was a dictionary application on the phone, because now they have the state government's mandate that they can actually use the language to teach at primary levels, but they have no content. They don't have any written things. They've been using a Roman transliteration thing, but now they have tones, et cetera. Some scholars had done some work and come up with some IPA-based script for them, which some people are for, some people are not for that. That script is not so widespread, and they want content. Here, even working on speech is not going to help, because they actually want written content. We are being very community-led here. If you think about it from an NLP perspective, creating a dictionary application on mobile phones is zero research, right? Then when you think of how important it is, and how it will build an ecosystem around it.

Sometimes you have to look at what you can achieve in short-term to keep the long-term going, because if don't build a dictionary app, there's absolutely nothing. If we suddenly say, "Oh, we want to create a machine translation system for you. We want to create a speech recognition system for you because you can speak into the phone and you don't have to worry about these texts," et cetera, but how are we going to do that? What are they going to speak? That's not what they need right now! They need a dictionary so that the teachers have access to it while they're teaching the students.

Chris Potts:It's encouraging in a way, right? Because you could have an impact with some pretty low hanging fruit that you could pluck, as they say.

Kalika Bali:Which is something that I keep stressing to all the people, especially the young people. You often need very low-tech solutions to have a larger impact, which can then set the ball rolling for you to do bigger things, but with the community.

Chris Potts:Right, right. That actually leads nicely into a question I had for you, because you're also kind of an HCI researcher. In your view – and it doesn't have to be in the Indian context, just in general as you look across the field – which areas of NLP really desperately need an infusion of ideas from HCI?

Kalika Bali:I think NLP in general could gain a lot by with interacting with HCI! In general, HCI is a field that has a lot of scholarly work on how people interact with technology. That, right now, is the big missing piece. What's happened is that we've forgotten somehow that language is a human thing. It's a human interaction. We've forgotten the human part of it. We now treat language in NLP as if it's just data. It's something abstract. It's something outside the human world. And the minute we do that, there're lots of things that can become a hurdle as we go ahead with this. Humans are messy. We're trying to create systems of silos where we are trying to do away with the mess, right? But then, those things have to go out and work in the messy world.

So, you cannot take the mess out. You have to learn how to deal with the mess. There's not going to be a perfect world where all humans will speak one kind of thing in all contexts, et cetera, et cetera, et cetera. We do messy stuff, and HCI as a field is much more qualified in some sense. They've built this over the years to understand the mess, to kind of have strategies of how to deal with this messiness, and I feel that part is really missing from NLP. We tend to view language in isolation. Language does not exist in isolation. There's no language without humans.

Chris Potts:Right.

Kalika Bali:It does not exist in isolation. And the other thing that people should understand: most of NLP leads to technology, which interfaces with users, right? Not only for what goes into building the technology, how language works, but even how your technology is going to interface with the humans, you need to understand what that interaction entails and how does it impact your technology.

I'll tell you a very funny story. We actually wrote up a paper which is going to be presented at COMPASS. It's in COMPASS at Seattle, end of June, about the six conundrums of doing NLP for social good. We talk about the various things that, as an NLP researcher, you have to deal with and think about, and what are the pros and the cons, what is solvable, what is not solvable, et cetera, et cetera. And the reviewers' response... Almost all the reviewers responded to that paper... The first thing was – because this is an HCI dominant conference, right? – "I mean, dDo you not think about these things with language? It seems like a fairly obvious thing. Is it the case that the NLP world doesn't think about it?" I thought that said a lot about how we do things in NLP. They accepted our paper, so they must have seen some value in it!

Chris Potts:Well, that's wonderful that you're being an ambassador like this and letting everyone know how the different fields are thinking about these problems. I think that's so important, yeah.

Aasavari, I have a few questions that I'm hoping to sneak in before we wrap up. Are there any student questions before I do that though?

Aasavari Kakne:We answered all.

Chris Potts:Okay. Kalika, I have a few more personal questions if that's okay. Do you have time? Is that all right?

Kalika Bali:Absolutely, absolutely, yeah.

Chris Potts:One is kind of just for me, since I've never been really in industry myself, I was curious, what is your day-to-day like at MSR India? What's the day-to-day for the research, or is it product development, or something in between?

Kalika Bali:No. So I would say that most of my days are probably not very different from yours, except that I wouldn't be teaching. We are much more like a computer science department in a university than a product development thing. Having said that, we don't build products. In MSR, we don't have the mandate to build products. We build prototypes, we build technology, and some of us work very closely with product groups on specific product-related research.

In the India lab, the three-axes that we always have to think about: one is company impact. So obviously, how can we influence or positively impact Microsoft as a company, but it doesn't mean here and now, because we have very, very talented people in the project groups thinking of here and now but more in the futuristic manner of the jobs that MSR worldwide has, is to think about what would the technology landscape look like, say, three years, four years, five years, six years, ten years from now, right?

The other is scientific impact. So as you know, we continue to push the envelope on the scientific side of things. And the third – this is very specific for out lab – is social impact. How do we impact society? How can we measure how we impact society?

The ideal, the very vanilla thing, would be that we do some research. We come up with some ideas and some product group within Microsoft takes it and builds on it or it's part of a product. Now that's one pipeline and it does happen, but a lot of times we build things which maybe Microsoft cannot use currently in their products, or it's something that, overall, the view is that would impact the research community more. And that's like much more valuable than putting it in a product or keeping it inside Microsoft, so we open source stuff, make things available, right? We put things out on Github for others to use and build on top of it.

And sometimes we have cases where we have something really mature, something that we know can actually work very well, but not within the company. It might not have a life. There might not be a product group within the company that will kind of keep it, so we spin it out. We've had cases where we've actually spun out startups from MSR, which then have their own life. So there's something called Digital Green that was spun out of Microsoft Research India, which focuses on agricultural support. There's something called Everwell, which was on tuberculosis monitoring, which got spun out of MSR India. Then Karya, which I was talking about, the crowdsourcing platform, that's been spun out as Karya Inc., as a start-up, as a separate entity.

And sometimes we scale through other partners and collaborators, so we have a center on societal impact through cloud and AI technology. So, it's called Sky, and the role of that particular group in the lab is to look for collaborations between MSR research and other organizations outside, to take that technology or that research to scale.

We do things on that end, but, as a researcher, I would probably go in the morning, have our first meeting with one project, which may include full-time researchers plus research fellows, interns, think about where the project is going, what we are supposed to do next, sit and read something, have a meeting with my research fellows and interns, and find out how they're progressing on whatever they're doing. If there's a paper deadline, we all kind of just as crazy into it and burning the midnight oil and beyond trying to finish up. It's a fun time. So on a day-to-day basis, it might not look very different from how your day looks like, Chris.

Chris Potts:It sounds like you have researchers at all levels. Are there pre-doctoral and post-doctoral as well, or what's the range of junior researchers?

Kalika Bali:Yeah. So first we have pre-doc, like I said the research fellows, we have post-docs, and post-doc again is a two-year term and after which they either go away and join something else or stay with us. There's an interview process. We have people who are one to five years post Ph.D. We have very, very senior people in the lab who've had more than 20, 30 years of research experience. So, yeah. We have a good mix of very young to very experienced.

Chris Potts:For the students in this course, let's see – so, they are entering the final phase of their projects. At this point we're going to try to talk them out of changing their topics, those should be set. They have a lit review under their belts, and they're kind of going to now be working toward executing the experiments and writing up the final paper. For students at that stage, do you have any advice for them? They're probably going to start to experience the highs and the lows of research and then the pains of the write-up. Any thoughts for them?

Kalika Bali:I think a lot of research is about sticking through it, seeing it through, but then you also need to... It's a good balance between knowing when to stop and knowing when to go on, right? So you have to kind of go on to a certain level, but at some point, you have to know how to exit whatever it is that you're doing at that particular time.

The other thing is if we're not having fun, at least some fun, at some point in the research that you're doing, you're really... I know people think this is a real cliche, and what do you mean by fun, and it varies from people to people, but I really do think that you know you're cut out for research if you're having fun. It's perfectly okay not to be cut out for research, right? It's perfectly fine not to aspire to be a researcher, but you know you're cut out for research if it truly excites you – something about it makes you want to go on and want to do more.

Chris Potts:I love that perspective. I kind of feel like I envy the students in a way because it seems like a magical moment to have entered the field. On the one hand, there are all these chances to be delighted. Delighted that Multilingual BERT works at all, which would've felt like science fiction 15 years ago. Or be delighted that GPT-3 can give you plausible answers to questions and things like that even with no task-specific training.

On the other hand, no meaningful problems are even close to being solved. And so it's kind of like this mix of: you can stare and wonder, and then you can think of all the challenges that you might participate in. And it really is amazing – because things are changing so fast that, the students can participate. They could discover something about prompt tuning or multilingual NLI or something that would be publishable, in addition to feeling enlightened.

Kalika Bali:No, I completely agree. I think it's a very, very exciting time to be in NLP because, on one hand, like you said, there's actually feasible, deployable technology that you can do stuff with. I think when I started – and it might be the same for you, Chris – we had to spend so much time just to get to a level where we could even show something meaningful coming out to open it! Yes! So I think the base now is so strong to build on top of, and there's so many exciting things that you can do. I have to say this – this might not be the thing you want to hear – but if I was a student now and I was starting at this time, I would just do so much stuff related to having real-world impact. I would probably become more of a social impact person than a core NLP person.

Chris Potts:That's a wonderful answer and I think really inspiring, now that it's within reach that a course project could lead directly to something that you could have an impact with. And you've pointed out lots of areas where there's the potential for that, so I hope that students are even thinking, as they close out their projects, how they could at least suggest that path into something with real-world impact. Yeah, that's wonderful.

Aasavari Kakne:We have one more question from the students. So there is a lot of merging going on, which also can create real-world impact like you said. What advice would you give to a student that is asking how to merge their existing career with NLP?

Chris Potts:Oh yeah. That's great.

Kalika Bali:So existing career with NLP, is it like... I don't properly understand, so their career would be...

Aasavari Kakne:For example, finance, I suppose.

Kalika Bali:Yeah. I think right now NLP is kind of an engine that powers a lot of different technologies. The possibilities are kind of amazing, right? Whether you want to do it in finance, even for programming GPT-3, for example, for how far we are going to do software engineering. I know people who are working on things like trust and privacy-related things with NLP, powered by NLP technologies. There are things related to DevOps, which people are doing along with NLP. I think a lot of it also is the time to experiment to see what works and what doesn't work. And maybe we'll see a little bit of shedding of weight around these things in the next 10 years, but I do think that right now, because of the kind of models that have become the core of the technology that is being created, that is just so applicable across domains and across other CS areas. So, there's lots that can be done.

Chris Potts:And we've touched on so many areas where, if you're fascinated with NLP, you could move them into areas that are going to have high impact. Multilingual – thinking about linguistic diversity, of course, and code-switching, speech technologies, but also thinking about interfaces, the HCI aspect, is a way to have immediate practical impact. Yeah. And on and on like that. I think it's all wonderful.

Kalika Bali:Yeah.

Chris Potts:Wait, I have one more funny question. By way of closing it out, Kalika, what's your current Wordle streak? I follow you on Twitter and I see your game displays. I thought I might tweet my game this morning at you, which I've never done before, just to connect with you today, but then I think I was a little embarrassed. I don't think I lost, but it was not, like, good. You've got a lot of twos and threes, I've noticed, which is very intimidating.

Kalika Bali:I think twos are purely luck.

Chris Potts:But three might be optimal, so three is like you're playing like an information-theoretic machine.

Kalika Bali:So four and three is where I have the biggest thing. I have done 120 something.

Chris Potts:That's your streak? You haven't missed it yet?

Kalika Bali:No, my streak is 70. But the total is 120 something, and I have luckily not had an X yet. I'm just lucky.

Chris Potts:Do you have a story you tell yourself about why this is actually interesting work to be doing, or is it purely a diversion for you?

Kalika Bali:I love words. You know what people say, they understand visual representations better. I actually have to read the stuff under the visual representation to make sense of the visual representation. So I don't know if it's my brain that's wired differently, but I love words and I love guessing words. It's a game I play with myself, really. I'm not very competitive, so I like the idea that this is just between me and me.

Chris Potts:That's a wonderful and cheering way to end. Thank you so much, Kalika, for doing this! I thoroughly enjoyed this conversation. Thank you, Aasavari, for all the questions, and also to the students who passed lots of various questions to Aasavari. This was just so great!

Kalika Bali:Thank you so much! Really fun morning for me too, thanks!

Chris Potts:Great!

Aasavari Kakne:Thank you, bye-bye!