CS224U: Natural Language Understanding

Podcast episode: Yulia Tsvetkov

May 16, 2022

With Chris Potts and David Lim

Coast-to-coast professional journeys, multilingual NLP, teaching in a fast-changing field, the history of hate speech detection in NLP, ethics review of NLP research, research on sensitive topics, mentoring researchers, and optimizing for your own passions.

Show notes

Transcript

Chris Potts:All right! Welcome everyone! I am delighted to have Yulia Tsvetkov here to discuss the field of NLP and all the issues surrounding it today. Yulia is one of the most vibrant and exciting researchers on the scene today, in particular, because of the remarkable range of her research, from very technical machine learning concepts, around especially natural language generation, all the way up through the most intricate social problems that we tackle as NLPers, and that the field itself is having to tackle as a result of recent progress and things like that. So this is a wonderful chance for us to get to know Yulia a little bit, and also get her perspective on the field and the kind of meta-issues surrounding it, especially as it relates to issues of, what you might call, ethical NLP.

So, welcome, Yulia! I'm delighted to have you here. I thought I would start with a bit of your biography. I was feeling a little competitive with you. If I have your biography right, your East to West Coast path is CMU to Stanford, where you were a postdoc, proud to say, with Dan Jurafsky, then back to CMU as a faculty member, and then out to University of Washington as a faculty member again. So that's four-cross country trips. Do I have that right?

Yulia Tsvetkov:Yeah. Two on the East Coast and two on the West Coast.

Chris Potts:Okay. Same for me. I've got, say, the East Coast to California to Massachusetts to Stanford. All right. I was wondering if you might say – you did your Masters and undergrad at the Technion in Israel?

Yulia Tsvetkov:Yeah, I did my undergrad at the Technion and I did my Masters in the University of Haifa with Shuly Wintner, which is in the same city.

Chris Potts:But is that considered West Coast?

Yulia Tsvetkov:In Israel?

Chris Potts:I guess it only has one coast, right?

Yulia Tsvetkov:Yeah. There is only one coast.

Chris Potts:But that's a chance for me to ask – I don't know much about this. Of course, Israel seems to me to be full of lots of NLP research. What's the NLP scene like there? And how is it different for you as a student versus now, and how has your research changed as you've moved from Israel to the US?

Yulia Tsvetkov:Yeah, that's nice question. To give a spoiler to students, I received some of the questions already by email, so I thought about this. So Israel has a really vibrant NLP community. There are seven major universities and every university has a very strong NLP team. It's not very common actually in other places, in other countries. And there is a big range of topics that have a top NLP researcher from a team in Israel. For me, it was, first, the exposure to all of this. So Israel has a yearly conference. Before even I published papers, before I learned what NLP is, having the opportunity to attend the conference and to just listen to research and to meet all those people, it was very inspiring to me.

And another thing. I grew up in Ukraine, speaking Russian and Ukrainian, and then I grew up later in my teen years in Israel. And so, I spent my life speaking Russian, Slavic languages, a third of my life speaking Semitic languages, and what it gave me is an appreciation of how rich languages are.

When we learn NLP as a very English centric thing, it's actually very wrong. By doing my NLP introduction in Israel, I learned how to work with Hebrew, and then Russian, and then other languages. It gave me appreciation of the difficulties of processing different languages.

Chris Potts:Oh, wait. So I have to ask, not to do that thing, because I'm a linguist, so I'm not supposed to ask this because I'm not supposed to be fascinated, but I have to ask – so how many languages do you speak? It must be at least four: Russian, Ukrainian, Hebrew, English.

Yulia Tsvetkov:It's Russian, Ukrainian, Hebrew, and English. That's it. Yeah.

Chris Potts:Well, it's at least three more than I speak fluently, so I'm impressed. I have to be impressed.

Did you know about NLP when you were in Ukraine and Russia?

Yulia Tsvetkov:No, not at all. No.

Chris Potts:Do you remember your path into it when you were in Haifa?

Yulia Tsvetkov:The path is not very exemplary for students who think about research. I graduated from Technion and I never took an undergrad course in NLP. And then I went to work in industry. After a couple years, I found it just boring. It wasn't fulfilling. So I went back to the Masters and then I met my Masters advisor, Shuly, who I collaborate with until now, and he's so passionate about languages and about the field, and I learned so much. I really couldn't not become passionate about it after meeting him. So it was only after my Masters, which I came to not thinking about an academic job in the future or even doing PhD, just because I wasn't happy in the industry positions. Then I learned about NLP and got hooked.

Chris Potts:So you were hooked and then you did the Masters and then you did go directly to the PhD at CMU?

Yulia Tsvetkov:Yeah. I did my PhD at CMU. I got accepted to Masters in CMU and there is a program of Masters that transitions to PhD at LTI. So it's basically a PhD program with an exit point after two years with a Masters. And I did my PhD in CMU.

Chris Potts:I love it. So just when you thought you were out, they pulled you back in for the PhD.

Yulia Tsvetkov:And the same happened with faculty jobs. Just when I thought that I'm out of there with my PhD, I got an offer to join the faculty in CMU.

Chris Potts:Oh! So when you were at Stanford, did you already have the offer to go back?

Yulia Tsvetkov:Yeah.

Chris Potts:Oh, wow.

Yulia Tsvetkov:Basically, I interviewed and got the offer from CMU before actually graduating and before having a Stanford offer from Dan. Getting an offer from Dan was harder than getting a job offer at CMU. Dan is hard to reach!

Chris Potts:I don't know! I think we should be suitably impressed by the fact that they hired you right away. And I'm glad you were able to leave for a little bit!

Was coming to Stanford noteworthy change in your trajectory as a researcher – the topics you picked and so forth?

Yulia Tsvetkov:Yes, it was amazing for me. It really shaped my interests very nicely. My PhD topic was on multilingual NLP, which I am still very, very interested in. But moving to Stanford, I thought I wanted to learn something completely new. And I started learning about computational social science. And then, through learning about computational social science, I learned about specifically gender-specific issues and then issues of gender bias, and then got interested in the field of computational ethics. It gave me new and interesting direction of research, which is now my one of the major directions, but also helped me shape the whole kind of research agenda, like why I'm interested in multilingual NLP, why I'm interested in ethics and NLP. So just building NLP approaches to make them more accessible to disadvantaged communities. It all kind of fits well together. It was all thanks to, I think, this change moving to Stanford, working with the Stanford group, working with Dan, learning from him so much. It was very fruitful. The postdoc was short, just 11 months, but I really, really appreciated it.

David Lim:I want to jump in here because I think there's a student question that's in line with this. I'm personally really interested in your bias work. I also used to work with Dan Jurafsky and I definitely read your papers maybe right after you were postdoc there, but there's a student – before moving more onto that – who was curious about like, what do you think the differences are between languages for NLP tasks? And do you think some languages consistently perform better for current language models? So I thought it would be a good time to talk a little bit about your multilingual work here.

Yulia Tsvetkov:I didn't get exactly the question. Could you repeat?

David Lim:So what are the differences between languages for NLP tasks? And do you think some languages are consistently performing better than others for current language models?

Yulia Tsvetkov:Oh, I see.

David Lim:And I assume that also is in line with your work with under resourced, etc. Yeah.

Yulia Tsvetkov:Yeah.This question already leads to an answer: with the growth of very large language models that are really data hungry, they will perform better on languages for which we have huge amounts of data, and these are languages of already more privileged communities in many ways.

With the growth of language models, we see that many, many existing benchmarks really focus on English. Of course, there are many benchmarks that are massively multilingual benchmarks, but when we report the results on average, we discard the fact that we have very, very high accuracy for languages for which we have huge amounts of data, like English and maybe also French. And then, there is a very long tail of languages for which we don't have good tools and don't have good performance for machine translation. And they don't benefit often from the major developments that we see in NLP research, especially in the last couple years.

Chris Potts:Can I ask a related question, Yulia? How might the field of NLP look different if we were all obsessively working on, say, Russian and we were building benchmarks for Russian. It has a lot of speakers and in a counterfactual world, it seems like it could have been the language of focus. How would we prioritize tasks differently and things like that?

Yulia Tsvetkov:Yeah. I don't know – if we would prioritized any specific language, we would get to the same result for the different language. I feel like the major change that we might think about doing is not to focus only on accuracy-centered metrics, but also on fairness-centered metrics across languages. And if we would focus on fairness and accuracy metrics, then we would see that there are a couple thousand languages just in Africa for which we don't have any tools. I think our research would be more balanced if we would just change the evaluation approaches from focusing only on benchmark, for which benchmarks that are easier to construct and easier to beat.

Chris Potts:I was thinking also that, just for an example, if Russian were our language of focus, maybe the field of morphological parsing would be huge.

Yulia Tsvetkov:Oh yeah.

Chris Potts:And if we focused on so-called Arabic, maybe robustness to dialect variation would be the thing we were obsessed with, or if we focused on languages that don't have such a standardized way of writing the language, then like robustness to spelling errors would be the thing that we all worked on. Whereas since it's English, we kind of split on the white space and assume standard spelling and so on.

Yulia Tsvetkov:Yeah, that's amazing answer. So basically, in many ways, English is a typological outlier among languages that are out there in the world, like languages with rich morphology, languages with just very different syntax. So I think NLP would be richer if we would just had more languages to focus on, not necessarily Russian, but at least languages from different language families.

Chris Potts:Agreed.

David Lim:Related to that, a student asked if languages with non-Latin orthographies are at a disadvantage, which I assume would probably make sense.

Yulia Tsvetkov:I think there are resource rich languages that are not Romanized. Chinese, for example, Mandarin Chinese, is considered a resource-rich language for many tasks, not for all of them – maybe not for dialogue – but for machine translation, for example. So I think orthography is less of an issue. More of an issue probably is where the funding comes from. There is a critical mass of researchers in the United States that just get the funding from, I don't know, DARPA and NSF and companies, which do prioritize projects on very specific socio-politically important languages.

Chris Potts:Do you feel, Yulia, like things are on a good trajectory for all this though? It seems like multilingual work is more valued than it was 10 years ago, in part, I think because people can do things with multilingual embeddings, but whatever the reason it seems like there's more interest in having systems that are robust to more languages. Do you agree with that or is there more to it?

Yulia Tsvetkov:I agree with it. Yeah. In part, it is because many tasks getting saturated for English, but are still unsolved in other languages. And also, I think the existing multilingual models, when we have a single computational analyzer for multiple languages, they actually help advance performance on lower resource languages.

Overall, for every future question, I am overall optimistic, and I think that today we are better off than we were previously.

Chris Potts:We just need to declare English, "solved" and tell people they should work elsewhere. That's right?

Yulia Tsvetkov:Well, I mean... [laughing] Probably not, but...

Chris Potts:On this topic, but I think shifting more into the realm of the ethical stuff, I was poking around at the websites for your courses, which look incredibly varied. And in particular, I was just kind of wondering how this "Algorithms for NLP" course, which looks like the core nuts and bolts of the field, might be changing in practice or in your planning as we move into this era of more awareness of ethical issues.

Yulia Tsvetkov:Yeah. That's a great question. I mean, this is a difficult course to teach, I think. Since I started my faculty job in 2017 – basically, in my interview in 2016, there was a question: we are teaching NLP, but now we are seeing that neural networks are here to stay. It was not clear whether we need to change anything in NLP. Since I joined, there was a constant question what to remove and what to add because the field is, just many things are just getting outdated. So I wonder: should I still teach HMMs to my students? Should I still teach Kneser–Ney smoothing to my students? Removing and removing, and adding more balanced content, trying to keep algorithms that I think are more generally important. It's going through a constant change. And ethics is also now added through the course. This is something that never existed, I think, before 2017.

Chris Potts:Here's an example that might resonate with you that's been on my mind. So the first unit for this course is on static word vectors. It's a good foundational issue that sets us up for neural networks and everything that comes after. And the material hasn't changed in a long time, because it's kind of like, here are the properties of cosine, here's how GloVe works, here's what word2Vec is, here's how to visualize these things, here's what they're useful for. I'm kind of feeling like one of the nuts and bolts now should be, how would you diagnose the biases in these word vectors, as a foundational question that you should ask. Because if you proceed into the world without ever having thought about that, it's almost as bad as you if you never thought about the difference between cosine and euclidean for the comparisons.

Yulia Tsvetkov:Yeah. That's a great point. Yeah. There is a question whether to incorporate ethics throughout introductory NLP courses or having it as a separate course, because there are so many issues. NLP itself is very, very broad. So it is very difficult to understand how to balance what to keep and what to remove and still give a good overview of the whole course. And then, if we add ethical issues to each of topic, it's not clear how to focus. So I'm teaching several lectures at the end, focusing on ethics, rather than putting it in between. And I don't know if it's the right approach. Overall maybe a whole course in ethics is better than just a couple of lectures in NLP course.

Chris Potts:Interesting.

Yulia Tsvetkov:It's hard. I know that my colleague, Noah [Smith], he has a homework assignment focusing on, for example, anonymization and obfuscation of gender. So there are ways to incorporate ethics into NLP courses, but I'm not really sure whether to keep it as an integral part throughout the course or just consider it as a separate course.

Chris Potts:It's tricky because I'm sure it deserves its own course. On the other hand, I would worry that people would self-select into taking that course, and that's probably the same set of people who worry about those things anyway.

Yulia Tsvetkov:That's what I'm seeing. Yeah. The people who take the ethics course, those are only people who are interested in ethics. And many people don't get exposed to it at all. Yeah.

Chris Potts:My thinking about the word vector thing and bias would be that I would introduce that as part of that unit, in part, to just convey that there is technical innovation that has to happen here, that's going to address an issue that's just as fundamental as the kind of core semantic things we address, and kind of get everyone on board just thinking about these as all of-a-piece, these questions, as opposed to separate things that you ask in a separate phase.

Yulia Tsvetkov:Makes sense.

Chris Potts:But I haven't figured out how to do it yet!

Yulia Tsvetkov:It makes sense. Everything that is related to sentiment. Not only to bias in embeddings, but sentiment analysis topics are very, very closely aligned with ethics-related topics. I can see what you're saying – that it's just an integral important, fundamental part that we were not aware of a few years ago.

Chris Potts:Oh yeah. This is a chance for me to ask. So you mentioned sentiment. Kind of adjacent to that is the task – that's kind of old by now – of hate speech detection. And this seems like a really interesting kind of cautionary tale for the field. What's your story of the story of hate speech detection in NLP?

Yulia Tsvetkov:Yeah, I guess there are several directions in relationship to hate speech detection that I'm thinking about. One is that hate speech detection is an example of a task that is really focusing on an extreme of the phenomenon of toxic language.

For a long time, our field didn't realize that the decision is not binary. It's continuous with types of language corresponding to toxic or abusive language. And it doesn't have to be extreme hate speech with derogatory terms, but there are also other types of biased language: microaggressions, condescension, gender bias, objectification, all kinds of types of negative or toxic interactions, which the topic of hate speech doesn't cover.

Another aspect is thinking about the task of hate speech as a people-centric task. And I think this is something that recent work identified very well, with the paper on racial bias in detecting hate speech. I don't know if the class read it or is familiar with it, Chris?

Chris Potts:We didn't read it as part of this class, but I'll have really good show notes for this episode and I'll link to all the papers you mention.

Yulia Tsvetkov:Oh, cool! Hate speech is an example of a task for which looking at language alone is not enough. This is a task which is really people centered. It's about who says what to whom. It's about people. And then, this is a task in which we really need to understand who are our annotators of the examples of hate speech. What is the social context? What is it that you see as toxic or not in this specific context. And whether our tools incorporate all these considerations to avoid the kind of misuses of hate speech classifiers.

Chris Potts:Yeah. That really resonates with me because my own understanding is: the field recognizes that hate speech is ruining online life for people.

Yulia Tsvetkov:Yeah.

Chris Potts:NLPers think, "Hey, we can help. This is what we do, right? First step, we need some data sets that have labels. So let's do what we always do and crowdsource some labels for individual tweets, say."

Yulia Tsvetkov:And this is problematic.

Chris Potts:Yeah. Well, take over from there. So why is that already problematic?

Yulia Tsvetkov:It's problematic because... Where do I start? There are so many ways it's problematic! So first of all, the annotation: I don't know even what to start from.

The annotation happens on the sentence level. But the same sentence can be toxic or nontoxic, hateful or non-hateful, depending on the relationships between the speakers.

Second, the annotators: how do we know what kind of biases the annotators on Mechanical Turk have. The annotations themselves are incorporating the biases from annotators.

Then, also, the researchers who decide what is toxic and what is not, they incorporate their own biases for how to define the task.

So the whole idea of having hate speech as a binary task, whether something is hateful or not, is problematic.

Third, the source where the data is coming from. So there is the issue of that, actually, the real examples of hate speech are removed by companies. So it's not easy to access the true examples of hate speech online, because this is something that can be very heavily moderated. So there are some data sets of hate speech that researchers are working on, but we don't actually know how much they are representative of the actual toxic content online.

Another thing: yet another paper that we wrote recently is that what is toxic and what is not can be really be a function of a community in which people interact. For example, we had this nice example in the paper: "Linux time." It is actually toxic and violates community norms if the community is focusing on Microsoft.

Chris Potts:Okay! [Laughing]

Yulia Tsvetkov:The whole issue of context in hate speech is really, really important. I think it's very true for any sentiment analysis related task, right? Chris, you're a world expert in sentiment analysis. Probably, it's the same. We are trying to work with a deeply pragmatic complex task in a very shallow way of labeling by Mechanical Turk annotators and then building surface-level classifiers.

Chris Potts:Yeah. For sentiment, my answer is always, let's bring in more context. And it sounds like that could be part of a solution in the area of hate speech, where it sounds like what you're saying is that we need annotators who know what's happening in the context and are aware of who's involved and what their goals and relationships are, and at that point they might be able to annotate successfully. And then, our systems need to also bring in all of the context that the annotators had access to.

Yulia Tsvetkov:Yeah. But there are other issues now too. First of all, the context that annotators need is probably context not only of the verbal interaction, but also who the people are. And this leads to issues of privacy that it's not clear how to address – how to annotate it properly while protecting the privacy of the speakers.

And also, the annotators themselves need to be diverse, because what is toxic in one culture might not be toxic in another culture, so we need also to understand the background of the annotators. It's the issue of privacy again. But also we don't often have access to diverse annotators.

Chris Potts:Right.

David Lim:A student question here: if you think that there would ever be an objective metric for classifying hate speech, and, similarly, what are linguistic tasks that have more precise metrics in comparison to this?

Yulia Tsvetkov:Yeah, this is a very clever question that this student has asked about the key problem that many of these things are very subjective. I don't know if there will ever be objective metrics. It's objective for a very narrowly defined problem. If we manage to define the problem of hate speech narrowly enough for a specific community, for specific types of speech, for a specific phenomenon that we model, we might be able to detect it more objectively. But then, the usability of the tool might be very, very limited.

There is a reason why the problem of hate speech is not solved yet properly, and probably will not be solved perfectly in near years. It's a very, very complex problem of deep meaning understanding that we don't yet know how to solve for any other kind of problem in NLP either – even for question answering. We don't really model deep pragmatic meaning well.

Chris Potts:But so, Yulia, is there something that an NLPer can do to contribute positively to a topic like hate speech in 2022?

Yulia Tsvetkov:Yeah. I think, overall, understanding the diversity of problems and actually introducing new problems, maybe new benchmarks, is probably a good idea.

I think that's actually true for any NLP problem direction today – that what we are focusing on is more subtle variations of the problems that we were focusing on before. We are interested in more subtle types of meaning. In machine translation, we are interested in more subtle problems of translation. And also in hate speech, just figuring out that, like now – it's not simply hate speech – there are many types of hate speech, many problems associated with hate speech, and trying to work on a variety of problems rather than on some major kind of most popular flavors of the problem. I think it's a practical direction right now. I'm curious. What do you think also, Chris?

Chris Potts:I'm happy to share! But, David, is there another student question before I do that?

David Lim:I was going to ask the question for myself, whether you thought there were things that are hate speech in every context, and then there's things that aren't. Is there a bifurcation there? And then the second thing is, I saw recently a news article and also some kids on TikTok talking about how they're using slang or particularly new linguistic constructs to avoid censorship filters. Some of which are for hate for speech, some which are not. I was just kind of curious about your thoughts about those two things.

Yulia Tsvetkov:Yeah. Can you repeat the first question? Binary?

David Lim:The first one is if you think that there are things that are hate speech in almost every single context versus things that are much more contextual. What percentage of the problem is pretty clear-cut versus all these more subtle nuances?

Yulia Tsvetkov:I think, like in every issue with language, there is a large variety. Even a sentence that is clearly hateful can actually be benign if it's said in a specific context between friends and with a specific type of communication. I don't know if that answers the first question.

Chris Potts:Yeah, I really have to agree with Yulia that, I think the problem is that there are no such phrases where you can just absolutely as a law of the universe say that is hate speech, and the assumption that you could do that is part of the reason why these early approaches kind of went off the rails. In an instructive way – I think we learned, so it was very productive that it happened, in a way, because it reflected back to all of us that simple assumptions were just not going to cut it, even for things that, if I just showed you some examples, you would say, "Okay, that's unambiguous."

The other part is just that even if, for the individual texts, you feel it's unambiguous, we're talking about deploying a model that's going to have some error, and the errors could be biased in some direction. And so, even if you get all your cases right, you might be systematically disadvantaging another group.

David Lim:Yeah. Right.

Chris Potts:And then, the other part of David's question is kind of interesting, which is: what about trying to get past certain filters, say, in the interest of freedom of expression?

Yulia Tsvetkov:Yeah. It is a very, very common thing. When we learn about hate speech, actually the intentional – the more malicious type and more planned type of hate speech – is always using code words and neologisms and all kinds of creative spellings to avoid the filters.

David Lim:Yeah. People are using like Cyrillic letters a lot.

Yulia Tsvetkov:Cyrillic letters, numbers, and even code words that are known within the community, but less understood by classifiers or even by annotators. When we look at the core and most malicious types of hate speech, they have the most of this kind of language, and that's why it's actually very hard.

David Lim:Yeah. It sounds like it's a constantly evolving problem versus one you can just solve with a fixed data set.

Yulia Tsvetkov:Yeah. Definitely.

David Lim:Yeah.

Yulia Tsvetkov:Yeah. Actually, in one of my lectures on hate speech, I show that there is a code book specifically for the community. They have instructions how to troll properly. There was a Huffington Post article that found this code book and published it. And it shows that you don't want to be too extreme because then you will be banned very quickly. So you need to be kind of slow in the type of aggressions that you use, use specific code words. Everything you mentioned, it's actually in the instructions on how to do trolling properly.

David Lim:Yeah.

Chris Potts:Wow. Hey, can I ask, Yulia, a related question, since this is really centered around data resources as a key ingredient. I was really inspired by your paper from a few years ago on microaggressions for lots of reasons, but in part, because it takes a very interesting approach to the annotators themselves and the biases that they're inevitably going to bring to the problem. Can you say a bit about that idea and what it led to?

Yulia Tsvetkov:Yeah. I'm really grateful for your question, Chris, because that paper had a lot of different types of content and typology. Now, I realize this idea was the central idea and the typology distracted from the actual interesting idea.

Yeah. So, we came to this idea just through looking at data and trying to understand what is the dimension that we are trying to capture. We'd been seeing a lot of disagreement among annotators on the comments from microaggressions.com. This is database on when people self-reported experiences of microaggressions. By looking at a lot of examples, we just realized that, if it's explicit hate speech – really like unambiguously toxic, hateful messages, or if it's a very, very friendly message – and we have multiple annotators, there will be high agreement between the annotators on whether it's toxic or not.

But when there is a comment that is slightly biased or slightly has a microaggression, then people will incorporate their own biases while annotating the comment. So we will see high disagreement – that their annotations will be all over the place. So we decided to leverage the annotation disagreement as an indicator for comments that deserve a further attention.

In the follow-up work, actually, on in the detecting gender biased paper, we again saw that we cannot address the problem of detecting biased comments by creating a labeled data set. We just cannot train a supervised classifier on something that is biased or not, but we can try to leverage annotator biases – again, human biases – to try to detect whether the sentence has some sort of gender-based microaggression.

Chris Potts:Because the kind of core insight there is that, for say, a gender-based microaggression, the female annotators were more likely to detect it than the male ones, by self-identification of the annotator.

Yulia Tsvetkov:Exactly. Yeah.

Chris Potts:Yeah. That's fascinating. And it's an interesting way of thinking about uncovering these more subtle things without pretending that somehow we could train our annotators to be unbiased, because they're people and they have their own perspective and everything, and that might be intrinsic to the problem. That's what I found so lovely about that work.

Yulia Tsvetkov:Thank you. Yeah. And in general, we don't always want to detect biased comments to filter them. They're not always intentionally toxic. Very often comments are biased because the speakers themselves think they sound positive, like a compliment and reinforcement, just because speakers themselves are not aware of their own biases. So one value of these tools would be actually to help people who would want to be more aware of their own biases to highlight them.

Chris Potts:I have a bunch of questions that I want to ask you that are about ethics reviewing. This is kind of about how we as a field are dealing with ethical challenges. And one question that's really been on my mind for you is very specific, but it might resonate with you because we've both done ethics reviewing for a large conference, right?

So here's a worrisome dynamic that I saw. For papers that are on socially relevant topics, like microaggressions, they are almost certain to get ethics scrutiny just because of the topic they chose and the kind of reviewers that self-select into reviewing papers on those topics – they're also very engaged with ethical issues.

Whereas if you submit a paper that is basically like "NLP for the war machine," those reviewers are not really interested in ethics review, they don't flag it, and that paper might sail through without any scrutiny.

So, you have a double-edged sword here, which is, on the one hand, if you just want to write about a socially relevant topic, you get this extra layer of hassles. And on the other hand, because of the nature of the system, for certain topics that deserve this kind of scrutiny, they might just not get them.

Yulia Tsvetkov:Exactly. Yeah. This is very good point. If you remember, like we wrote our joint proposal, Chris?

Chris Potts:I remember well! Yes!

Yulia Tsvetkov:It never get funded because the reviewers were very, very emotional about it, and it got a lot of scrutiny, and we're seeing this research is being done more recently. Now, probably it would be easier. I cannot agree more.

I was ethics reviewer and senior area chair and ethics exec now. And I feel like, through all this, the biggest problem is that we are not trained really as a community. There is no training to understand how to evaluate ethics in a paper in a balanced way. So we are seeing a lot of false positives and a lot of false negatives. Whether a paper should be flagged for ethics review, people just don't know.

And then, there is the problem that we don't have enough qualified ethics reviewers, just readers of papers, as you mentioned. NLP for the war machine or any other kind of paper, we cannot trust technical reviewers to accurately flag papers for their ethics review.

Chris Potts:Yeah.

Yulia Tsvetkov:People who try to work on ethics, they receive most scrutiny. And people who unintentionally or carelessly have ethical issues in the papers, it's easy to miss it because we don't have also qualified reviewers.

Chris Potts:I hear you. Yeah. But wait, here's another thing to worry about. If we separate out scientific reviewing from ethics reviewing, then you get a very common dynamic of scientific reviewers saying things like the following: "This paper depends on a false assumption about how gender works. The results are interesting, but it needs ethics review." And they would sort of abdicate the responsibility. They had seen a problem with the core scientific idea, but they thought it was an ethical issue, and so they said the paper is sound scientifically, but it needs ethics review, when what they should have said was: from its very core, this paper is mistaken because it is making a mistaken assumption about people. And on those grounds, which are scientific grounds, it should be rejected.

Yulia Tsvetkov:Exactly. We're so unprepared to handle ethics in the papers that we don't even know how to separate what is part of technical review and what is part of ethics review.

Chris Potts:And they might not often be separable, right? Because they could be intrinsically linked, like in my case.

Yulia Tsvetkov:Yes. Yes.

Chris Potts:So what should we do? Get rid of the ethics reviewing entirely?

Yulia Tsvetkov:No! My takeaway is: I added a new homework assignment in my course on ethics, actually on ethics review. I realize that it's part of training. I prepare every iteration. I prepare, I don't know, 30, 40 new ethics reviewers. And I think over time, I hope there will be ethics review assignments in an ethics class that is taught as part of the curriculum, so people just get training on how to think about these issues.

Chris Potts:I fully agree. Yeah. It feels like, it's something we have to work on and get right. And I hope we do work through these difficult questions, as opposed to assuming that the simple systems we have now will suffice.

For your class – I missed the details – do you recruit the students as ethics reviewers of each other's work or you just try to train them a little bit?

Yulia Tsvetkov:I look at the NeurIPS papers NeurIPS has open review, where papers are anonymized, but reviews are on OpenReview. My class on computational ethics, one of the homework assignments is to actually write an ethics review. The students need to read the instructions on how to do ethics review, then write review, and then we give them feedback about their review. And we choose papers which are accepted, so they're technically proper, but they were flagged for ethics reviewing. And then, we take the final paper, after it addresses ethical issues, and we just kind of hide the parts on ethical considerations, and we ask students to reveal what are potential ethical considerations that the author should have talked about and how to review it properly.

Chris Potts:Oh, that's outstanding. Based on my first insight, you should also randomly sample papers that weren't flagged and have your students determine: should these have been flagged? Because I bet a high percentage or actually indistinguishable from the ones that were flagged, or worse, based on topic.

Yulia Tsvetkov:That's good point. I'm sure students in the course are actually very, very good. They are scrutinizing papers so much. I wish the actual ethics reviewers were so thorough. And the hard part of it is that I decided: if I let students scrutinize other people's work, I should also give my own papers. I anonymize and I give my own papers to them for ethics review. It's not an easy process to listen to the feedback!

Chris Potts:For the papers that students write in your courses in general, do you have them have an ethics section now?

Yulia Tsvetkov:Yeah, absolutely. Yeah. I remind them that they're doing the course in ethics. I currently teach an NLP course in which students don't write papers. This is an undergrad course. This is the first time in UW that I'm teaching it. Before that, I was teaching the graduate version, but it was homework based. For the ethics course, I remind them that you have to have ethical consideration sections in your course project, if you are working on this topic and if you're studying it.

Chris Potts:So this course that we're in, so to speak, right now, is not an ethics course, per se. It's just Natural Language Understanding. But I do, as I said before, want to feed these things into the curriculum. What I did this year is have a required section of the final paper that asks students to disclose, to a well-intentioned user of their ideas, some risks that might be inherent in the ideas. So a very specific sort of question, that's not trying to fend off bad actors, but rather help someone who means to do the right thing, but might not understand the limits of the ideas.

The reason I was so specific is that I've noticed that some of these ethics sections are kind of very, blandly like, "Well, we ran some machines for a while and that had an environmental cost and any classification decision might have errors that could harm people," and they kind of take their job to be done, having said those things. And I was hoping, with specificity, it would lead to something more meaningful.

Yulia Tsvetkov:That's a great comment. I think we have a lot of such discussions actually as a part of ethics exec group, when we are trying to define what are the right ethical guidelines, because they're changing from conference to conference. So how to unify them? And we find that one important big problem is that there is a requirement to write an ethics section, and there is something that is called a checklist. And a checklist lets people treat the ethical consideration section bureaucratically, and kind of reduces the understanding of why we are writing this section. It shouldn't be just a template or a checklist. It has to be more thoughtful, and thinking about specifically unintended users rather than mechanical things. So there is an issue with checklists overall that...

Chris Potts:I think I didn't realize until now that you had risen so high in the ranks of ethics reviewing that you're now empaneled as the ethics reviewer across everything. Do I have that, right?

Yulia Tsvetkov:I don't know.

Chris Potts:For the ACL, you're on the exec committee that's over all the conferences?

Yulia Tsvetkov:I'm a co-chair of the exec committee on ethics, and we are trying to figure it out. We run the survey on what people think about ethics and how to deal with it. And we are trying to unify guidelines for ethics reviewing. And there are so many. I hear you when you're saying that it's complex because very often, just among three of us – just among chairs, not even co-chairs where there are 10 of us – we cannot agree on issues like what to do with the ethics of arXiv reviewing, what to do with checklists, what to do with cross-cultural considerations of what's ethical, whether ethics is kind of different across cultures, or ethical frameworks that are right, because with different ethical frameworks, we can make different decisions about how to run the process of reviewing. It's very difficult.

Chris Potts:Yeah. One more question along these lines and then I'll go to David. Is there a workshop or conference that you feel has kind of gotten it right in terms of what reviewers do and what authors do around ethics?

Yulia Tsvetkov:I think over time we are getting better and better. So starting this year, I think, things were much, much better than two, three years ago, because every conference just adopts our previous experiences and tries to improve it. So NeurIPS this year, and last year also, and we read the data report that you wrote about issues with ethics reviewing, and they integrated it in the next conference. Overall, I think things are much better than two, three years ago.

Chris Potts:Oh, that's wonderful. Because as long as we are learning lessons of the past, as opposed to just repeating what the previous iteration did, I feel like the community will evolve some norms that we can live with. That's great.

Yulia Tsvetkov:I also think so. Yeah. Overall, the trajectory is positive. It's just very difficult to agree on things. [Laughing]

Chris Potts:[Laughing]

David, did you have some questions from students?

David Lim:Yeah. Regarding your students doing ethical reviews of papers, a student asked whether there was anything that the original paper did not address that they caught? I'm assuming either on your paper blinded, or someone else's.

Yulia Tsvetkov:Yeah. There were things.

David Lim:Interesting. Yeah.

Yulia Tsvetkov:Yeah. I don't remember exact specifics, but there were things. The students had access after they wrote their own review. They had access to three ethics reviewers on OpenReview in NeurIPS and to our comments and author comments that addressed the review. And after all this, they still found problems that the authors didn't address. I feel like, very often, there was one new question that we grappled with after reading the student reviews.

We need to be balanced in a way: even when ethics reviewing every NLP paper, every paper that is related to ethics, it will never be very perfect. If it were easy to address things in a perfect way, it shouldn't be probably part of this research. It's already goes to industry, to engineering. It is okay to have things that are not entirely perfect.

Chris Potts:I think that's a wonderful sentiment because also, unless we accept that as a community, then we're not going to learn. Part of this should be that you could make a mistake, the field could figure out that it was a mistake, and then we can push forward. And that's why when I told the story of hate speech, I was trying not to be too judgmental because it started from a great place, people did standard things for our field, and I think it helped us realize the complexity of the problem and helped us inch toward solutions. And I think ethics reviewing could be part of that as ethics reviewers become experienced and exposed problems that we would've overlooked 15 years ago.

Yulia Tsvetkov:Exactly.

Chris Potts:But then it shouldn't be punitive on the researchers that they made that particular mistake or that disclosure or something.

Yulia Tsvetkov:Yeah. I also feel so. Recently, there are many papers, position papers, on how the field treats bias wrongly or how we miss issues with large language models. But those critique papers are only possible because someone before that was actually able to develop some approaches and to make some steps.

Chris Potts:Exactly! Yeah! And if it was the normal course of technical ML research, we would just say, "Oh, you know, we're learning and getting better." But since it has this social component, it comes with this aspect of ethics that might make people feel like they did something bad as people when in fact we're just trying to learn. And I think by and large, we're talking about well-intentioned people.

Yulia Tsvetkov:Exactly. Yeah. Nobody wants to write a paper with horrible ethical implications that they didn't think about.

Chris Potts:Maybe the authors of the troll manual that you mentioned, but I would hope that our system would catch such a submission.

Yulia Tsvetkov:Exactly! Yeah! I hope so!

Chris Potts:Can I ask another concrete question about this that's really on my mind as well. So, for your group, you work on a lot of these socially sensitive topics. You're also interested in open science. How do you all think about releasing datasets that are on sensitive topics, like microaggressions or hate speech?

Yulia Tsvetkov:Yeah, it is a difficult question because I really, really want all of my research to be open, but some of our projects just cannot be released. When things are ambiguous, we are trying to follow the terms of service, with Twitter, for example.

If things are more sensitive – for example, we have a really interesting sensitive project on decisions support for child welfare applications – in that cases, we don't release anything, but we also don't even put examples in the paper from our data.

In other cases, such as our recent project on propaganda and misinformation in recent Russian-Ukrainian corpus, we might even, a little bit, violate terms of service and release some scraped news or data from kind of government outlets from Russia to facilitate and encourage follow-up research on detecting propaganda and misinformation. So I'm trying to do everything that is legal, but also to put some common sense into what can be released, what cannot be released.

Chris Potts:Well, the terms of service thing. So, this might be one of the things you argue about on your exec committee, but I feel like that should be treated separately, because I believe that you could, in violating terms of service, do something that was actually ethical, because that's a legal framework that might be disadvantaging certain users. I mean, who knows what's in that fine print and whether or not we would regard it as ethical. And so putting that as a kind of overarching thing on ethics review is a mistake, and I suspect that the example you just gave is a potential instance of why I'm concerned about that.

Yulia Tsvetkov:Yeah. Yeah. I totally agree. It's very tricky. Terms of service are written by lawyers of specific companies who have their own interests, and this might not perfectly align with what's right and what's wrong to do. People who agree with terms of service, they might not be aware what they agree to. So yeah, it's very tricky.

Chris Potts:David, I have one more question that's going to put Yulia on the spot, but before I do that, other things about ethics reviewing from you or the students?

David Lim:I think that should be it on ethics reviewing. Yeah. There are a few other questions that are less related, but we can ask them later.

Chris Potts:Yeah. All right. We'll move into some new topics, but here you, I can't resist putting you on the spot a little bit since you're at UW, a hotbed for producing massive language models. Very impressive. So here's my very specific ethics-related question for you. Should we institute or require institutional review board – that is, IRB – oversight specifically of creating large language models?

Yulia Tsvetkov:Yeah. This was one question in the list that I was not sure actually how to respond to!

I don't see the reason to do it actually, because there are clearly ethical risks to large language models, but the IRB will be focusing on university research, when there is already a huge disparity between what can be done with large language models in industry, by people who don't need to file an IRB, and academics, who just don't have resources to train large language models.

So it's not clear that it makes sense to delegate this question to IRB. There are many problems with large language models, not only environmental, there are privacy issues of what data they use. There are bias issues and also kind of accessibility issues – even how many resources we need, fairness issues. With all this, it's not clear that IRB is the right institution to deal with it.

Chris Potts:I totally agree. I think you're hitting on something really important that I think about when people call for IRB review of anything at NLP, which is: I have, on the one hand, a fear that they would just put a dead stop to all of it, like just by asking, "Have participants agreed that their data can be contributed?" and then you've got to audit every single data point in your data set.

Yulia Tsvetkov:Yeah, it's not feasible.

Chris Potts:Or they'll just say, "Yeah, that's completely exempt from our oversight. It doesn't qualify as human subjects research in a way that we care about. And therefore you can do anything you want." So what about a more refined question: should we have specialized IRB style things that are just for large language models, which would do a very particular form of oversight?

Yulia Tsvetkov:I don't know. I don't think so. I don't see why. Are you asking because language models can use Twitter and Reddit and, I don't know, my email data?

Chris Potts:Well, no. I guess I'm just... It's a leading question, but I've been told that they're the most dangerous thing in the field.

Yulia Tsvetkov:I don't think so. I don't know. [Laughing]

Chris Potts:Okay. [Laughing]

Yulia Tsvetkov:I don't feel that they are the most dangerous thing in the field. I would encourage research on interpreting decisions from language models, more technical research. I don't know if I would want to incorporate more policies around it that will restrict research.

Chris Potts:That's my feeling as well – that the community is in a process of figuring this out, as a community, and that centralized bodies trying to do this would just impose a particular set of values that would be very distorting. Yeah.

Yulia Tsvetkov:I agree with you. Yeah. I would be against it, but I'm also not a fan overall of pure research on large language models. So maybe I'm just not too connected to that field. I clearly see how huge language models benefit large companies, help them monetize their products better, help them beat the benchmarks, which gives them good PR. But overall, it's not the interesting long-term research that we want to focus on in academia. That's a very biased view, of course. I understand why people are excited about large language models.

Chris Potts:But this is kind of a nice lead-in to my next set of questions around ethics, because we've been focused on ourselves as a field and how we communicate our scientific results. But now what about systems that get deployed and might do dangerous things? I think that the fact that we all recognize that can happen is a call for scientific innovation as well as cultural innovation. Maybe there are some machine learning and NLP things we need to sort out, but also some things that we need to sort out that we can provide to the world around figuring out whether a model is biased, or figuring out where it's going to be dangerous when deployed. And those are like really central scientific questions.

Yulia Tsvetkov:Yeah.

Chris Potts:So what are the big things that need to happen in terms of innovation there?

Yulia Tsvetkov:Interpretability is a big direction, because without being able to understand why models make the decisions that they make, why they generate the text that they generate, it's very difficult to go forward with analyzing bias and misinformation.

And controllability. If we think about controlled text generation: language models are really fluent, but also they don't have the intent. They don't have good control of whether the content is factual. They don't have good control of whether it's coherent with previous things that they generated. They don't incorporate cultural values. Well, they basically overfit to the cultural values of the majority corpus. So all this is related to being able to understand what they do and then being able to control what they do. And these are not ethical, these are technical questions that, I think, are more urgent.

Chris Potts:I totally agree. Yeah. I just imagine: suppose you're at a start-up or something and you've developed a question answering system and you think it's going to have value for your users and for your company. And then you do the thing of saying, "But wait a second, where could it be dangerous?" And you look to the NLP literature to help you understand where it could be dangerous and where it could go wrong. I feel like we're not currently giving a full enough answer, but certainly interpretability tools would help them enormously, right?

Yulia Tsvetkov:Yeah. Yeah. Then you can define different suites of evaluation, trying to probe it, but you will not be able to probe it for all possible ranges of the scenarios that you are thinking about. Having a interpretability and explainability of language models, it's something that can help them understand them over time. And controllability is a way to dynamically adjust, to incorporate additional constraints into models, so that, if we reveal new problems, they can be addressed without the need to retrain the whole model from scratch, just to remove these data points that are problematic. There should be some external control component on what they generate and what they produce which is independent of the language model.

Chris Potts:Interesting. So like separate modules that could provide hard guarantees that a model will or won't do something.

Yulia Tsvetkov:Yeah.

Chris Potts:Do you think, for these things that you've highlighted, that the field currently values them highly enough relative to other things?

Yulia Tsvetkov:I think so. Yeah. I think there is awareness of interpretability research. It's very difficult to do. It's very difficult to do, because if we want to have proper approaches to interpretability, they need to be able to go through the whole training data, not only the parameters, but also the data they are trained on. And this is not accessible, and also too big to be able to process. So there are some hard search problems, but I think there is enough awareness that interpretability is an important topics in large models, as well as controllable generation. Controllable generation is also moving towards coding, post-processing-type approaches rather than modifying architectures themselves, which is currently already not feasible. We don't want to retrain ... whatever the name of the recent huge language model is.

Chris Potts:The biggest one? Well, as of when we last checked, I guess it was PaLM at 540 billion parameters, although my slides from the first day of this quarter are already out of date because of that development.

Yulia Tsvetkov:Yeah.

Chris Potts:But this is kind of my concern, which is that we all know about these huge model releases, and that's a feedback loop where you know the companies are dreaming about a trillion or whatever they want to get to. And so we read lots of headlines around that. Where are the headlines going to be around breakthroughs in interpretability? Could that get on the cover of Wired magazine, or only PaLM-2 that has 600 billion parameters?

Yulia Tsvetkov:That's a good question.

Chris Potts:I'm all in favor of the language model stuff. I just worry that it has captured outsized mind-share. It's overvalued relative to these other topics right now.

Yulia Tsvetkov:Yeah, because of the huge PR, and it's important to understand that those models are not released to, also to researchers they're not available, the data, the architectures are not available. So the scientific contributions is really like tiny. Other than news articles, we're not seeing anything out of these language models. Yeah. I hope that the articles in Wired will be on if these interpretability approaches find ethical problems in the models, unfortunately.

Chris Potts:Oh yeah. You do see those headlines around pernicious biases that they embed. And frankly, I feel that's a healthy antidote to the glowing reports of how large these models are. And I guess that's like: artifact / problem. What I want is like something like the solution, and to that have to get headlines. It's not that I care so much about headlines, but what I want is smart young people to want to work on the topics you highlighted in addition to the large language models.

Yulia Tsvetkov:Yeah. I agree with you.

Chris Potts:David, any other questions before I switch gears a little bit?

David Lim:There was a question that's somewhat relevant: given the diversity of toxic definitions and the laboriousness of gathering data for each such definition, could a solution that involves incorporating moral principles directly to language models through something like prompting be practical?

Yulia Tsvetkov:Going back to the Delphi paper. Yeah, so I mean, to those who haven't seen the storm: there was a paper on incorporating kind of moral decisions and getting moral understanding of what's right and wrong to do. The Delphi paper, a project led by Yejin Choi.

I feel it's a very important paper It had a negative storm. I don't know if I can say it – shitstorm – on Twitter in response to the paper, but this was one of first papers that actually explicitly acknowledged that whether we incorporate these moral decisions into the models or not, they are implicitly making these decisions just without any of our understanding or control. So, I feel, over time, it is important to understand what kind of moral decisions the models make and how to better kind of control for those. So, yeah.

Chris Potts:I'm so glad you mentioned that, Yulia, because this is actually the case I had in mind before, when I talked about this worrisome dynamic where people venture into important but fraught topics and are punished for it. Because from my perspective, what the Delphi team did is take on a hard problem that we have to think about because, well, whether we like it or not, these models are going to spew out texts that looks like it's moral judgments. Then, they wrote a paper and they also did an open API – the maximum openness thing of enabling anyone who wanted to audit the system so that if it was dangerous, we would discover it– and they should be applauded for that. And instead, many people insinuated that they were bad people for having created this model, which is precisely the dynamic that could lead the next team of researchers to be less open. It's not going to cause the problem to go away if we're less open.

Yulia Tsvetkov:Exactly. Yeah. I totally, totally, hundred percent agree. Yeah. I also think that what they did was amazing. Of course, it was not perfect. It cannot be perfect. It's a huge, complicated project. But the response that they received was out of proportion in terms of negativity and the attacks from the community. And this goes actually to the original comment that you made, Chris, that people are very involved and there is a self-selection and, as a consequence, a lot of scrutiny on things that people are trying to get right.

Chris Potts:The scrutiny is great. I just don't want it to lead the same team, for the next time, to decide not to do an API because it just caused them grief. And now we don't know as much as we would have about what the system was doing. Yeah.

Let's switch gears a little bit, Yulia. Do you have another few minutes to talk – one more topic?

Yulia Tsvetkov:Yeah, yeah.

Chris Potts:So I'm very curious: you run a large group and the group is also working on lots of different topics. How do you orchestrate this? Do you have a grand plan

Yulia Tsvetkov:[Laughing]

Chris Potts:Or is it just kind of students coming to you and saying, this is what I'm passionate about, and you just facilitat it or... I don't know, how does it work?

Yulia Tsvetkov:I have 10 people. I don't know if it's a huge group. Is it a huge group?

Chris Potts:Ten people at the level of involvement that I know you have in all these people's lives and research sounds very large to me!

Yulia Tsvetkov:I'm very involved. Yeah. When I accept people to the group, I interview very thoroughly. I want to see that the person is very passionate about NLP, that they have some concrete directions that they have thought about. I want to see agency and passion and interest and some higher level understanding of the field.

I also want to see that this is a kind and nice person that will fit into the group.

I really have a heavy interview process. I turned it into a very people-centric approach. I just try to just focus on whatever is interesting to you, whatever we can do to make you to find the best version of you as a researcher. This is what I'm trying to do. And this leads to a more broad research agenda, because people think about different directions and different ideas, which I try to encourage. So I don't have a grand plan. I just try to be very people-centric in my advising style.

Chris Potts:And the students can suggest any topic and in principle you're open to it, or are you nudging people toward a specific set of issues?

Yulia Tsvetkov:I'm very passionate about the set of issues based on which I interview people, so this kind of filtering happens at interview time. But if my student decides that they are now excited about new things that they want to try that they never thought about, I would be very open-minded to try to go into that direction, unless it's a problem that I really don't like, but usually my students know well enough!

Chris Potts:I was trying to think of a problem, as an example,that you haven't worked on, but I became nervous because you've worked on so many things that I'd probably name a topic you have worked on.

Yulia Tsvetkov:I never worked on syntactic parsing. I never had the paper on syntactic parsing.

Chris Potts:Oh, okay. Well, if it has a social angle, which surely it does, because I'm sure our parsers affect different groups in different languages differentially, then maybe students could convince you.

Yulia Tsvetkov:This could be interesting!

Chris Potts:Well, there's one question about you that I'll kick myself if I don't ask. So, your research is on all of these important topics, and I think it's having a big impact in the scientific community, but in an alternative universe, you could decide to go into industry and use all these lessons to seriously shape a system that Amazon or some large company deployed, and, in that way, have an entirely different kind of impact on presumably making a lot of people's lives better. Are you ever tempted by that alternative industry path?

Yulia Tsvetkov:I don't want to say never because maybe we will not be able to pay the mortgage! This will be a motivation for me to join the company. But, so far, I'm really centered on academic longer-term research problems, problems that are motivated not by product. And also, I believe that the most important product of my work is actually not papers, but people, my students, and this is not aligned very well with research and industry. So I'm not. Yeah.

Chris Potts:I love that perspective – that it's not papers but people. I think in similar terms. That's really nice. Yeah.

And actually that's a nice final question. So, students right now, for this course, are doing projects. They're kind of headed into a phase where they're really planning the protocols for these projects. They've presumably picked a core topic already. You've supervised lots of projects like this, it sounds like, in various contexts. Do you have any advice for my students about how to make this successful and rewarding?

Yulia Tsvetkov:Specifically for projects, choosing a project?

Chris Potts:Yeah. For students who might, in particular, be doing one of their first projects in NLP.

Yulia Tsvetkov:Yeah. I think: just pick a problem that you find really interesting. The objective function should be just your passion, not the benchmark that it's clear how to beat, not the problem that might have a bigger impact. Just what is more interesting to you. That's the only thing that I think I could advise.

Chris Potts:Yeah. That's typically the advice I give. And what about them in the day-to-day of doing this project, where they might experience highs and lows when they see results come in? What about the process of doing research?

Yulia Tsvetkov:Maybe, advice that I give myself when I get stuck is just to think out of the box. When I go to conferences, I try to attend the tracks and talks that are as far as possible from my research directions, just to learn, just to be more open-minded. And same with research: if I'm working on the research and I'm feeling stuck, I just start reading broadly and thinking about the problem more broadly and just looking in other directions, just to get out of the kind of books and the narrow thinking that I have right now.

Chris Potts:Yeah. We try to encourage that. I guess the field really encourages students to think about what's a safe bet in terms of a leaderboard, let's say.

Yulia Tsvetkov:Yeah, they are.

Chris Potts:And it makes them nervous to try something that might not succeed. And we try to instill the value that it's about the hypotheses and the exploration, but I can feel myself having to fight back against all the leaderboardism.

Yulia Tsvetkov:Yeah. I also have this challenge because often, when people come to my group, and we start working on projects, these are usually hard projects, and I feel often guilty because I could give a really low hanging fruit and we could publish a paper quickly, but instead they suffer for a year to get to this good paper, their first paper. I also feel like I go back and forth on this, but overall I prefer big open-ended problems, and not necessarily focusing on these benchmarks.

Chris Potts:I think that's the right way to pitch it, which is like, sure, there could be a satisfaction in getting epsilon higher on a leaderboard, and that could be very exciting. But it's also exciting to be in a wide open space where we don't know what solutions should look like. You might have to live with pretty poor looking performance numbers, but each one of those might be teaching us something that could lead to a real success.

Yulia Tsvetkov:Definitely. And this is really how we start enjoying actually doing research, or at least this is how we know whether doing research is our thing or not, just trying out this kind of problems that are really open ended.

Chris Potts:I love it! That's a beautiful way to end! Thank you so much for doing this, Yulia! Thank you to David for all the questions and to all the students out there who passed questions to David. This was a really rewarding discussion. Thanks again!

Yulia Tsvetkov:Thank you. For me, it was really nice. And thank you so much for inviting me, and really nice to meet you all. Thank you, Chris!

David Lim:Yeah! Thank you so much!