CS224U: Natural Language Understanding

Podcast episode: Diyi Yang

August 1, 2022

With Chris Potts

Moving to Stanford, linguistic and social variation, interventional studies, and shared stories and lessons learned from an ACL Young Rising Star.

Transcript

Chris Potts:Welcome, everyone, to the CS224U podcast. I'm delighted to welcome Diyi Yang to the podcast today.

At this moment, Diyi is a professor at Georgia Tech, but I'm proud to say that she will be a professor here at Stanford starting in September of this year. We at Stanford are absolutely delighted that she is making this move.

Diyi is a prolific scholar with wide ranging interests and with goals that range from pure scientific inquiry to technology development, always in the service of improving people's lives. She's worked on topics in bias and representation on the web and in our own NLP technologies, and she's contributed numerous benchmark data sets to help us move toward developing more equitable systems.

In doing this work, she really spans the fields of NLP, computational social science, and human computer interaction. This is just a really inspiring model for interdisciplinary work.

So, Diyi, welcome to the podcast! Thanks so much for doing this and welcome back to California. I say welcome back because I recall that you visited Stanford in 2018 to work with Dan Jurafsky. What were the circumstances of that visit?

Diyi Yang:Thank you, Chris. Thank you so much for having me. I'm also very happy to be here. Yeah, back in 2018, this was before the pandemic, I was a visiting student with Dan Jurafsky, right after my thesis proposal. Those days, I think, internships were more popular. But at that time we were thinking that, instead of an internship, being a visiting student might be a better fit for me. Dan was on my thesis committee, so I was hoping to get more opportunities to discuss my thesis work with Dan. And then also get to know more about the Stanford NLP Group as well as enjoy the summer in California.

Yeah, it was in the summer, and I remember the campus was less crowded, and it turned out to be very helpful for my thesis work and writing. And I got a chance to also share my work with the NLP group, other students, receive feedback. The NLP seminar was extremely helpful. I remember there were talks from both internal and external speakers. In addition to the thesis, I also got time to brainstorm other topics and ideas. I remember I came to you for discussion on something related to implicature in language.

Chris Potts:I remember the meeting well!

Diyi Yang:Yeah. I really enjoyed my visit that time.

Chris Potts:And so you already have a sense for the campus and the NLP group, although it is quieter in the summer. When you get here in the fall, you're going to find that it's just aswarm with undergrads on bicycles who are not paying attention to where they're going. So keep an eye out.

Diyi Yang:Thank you.

Chris Potts:And it really is wonderful that you're going to come out to be a CS professor here and a professor in the Stanford NLP Group. Aside from this being the best place in the world to do NLP, what attracted you to come to Stanford?

Diyi Yang:I mean, Stanford is a great university and there are many dimensions and factors for why this is a great choice for me. As I just described, through my visiting student experience, I got to know the NLP Group a little bit more. To me it's very friendly, supportive. Sometimes they feel like a big family when I look at faculty and students interacting with each other. Not only the NLP group, the new Stanford HAI aligns very well with my interests, with what I believe AI technology should be or shouldn't be doing. And my own work is also, as you describe, at an intersection of NLP, people, technology. So my students and I, we really care about how we can build NLP to have positive impact on people in society. So I think the interdisciplinary atmosphere at Stanford, or this type of hub or community, can really provide us with opportunities to push our technology a little bit more.

Yeah, I guess it will be a great fit for us to do interdisciplinary work. I really enjoy the process of talking to people from diverse backgrounds to figure out what the practical problems are and then get inspiration, working together to do something useful.

I have to admit that the location is very attractive to me as well, beyond the connection to industries in Silicon valley. Many of my undergrad and grad school friends, and close friends, actually are in the Bay area. So it always feels like a second home when I visit. And also outdoor activities. So I'm quite excited about life in Bay area.

Chris Potts:This is all so great to hear. Yeah, so I'll reiterate with you that I think the Stanford NLP Group is a wonderful environment in part because it does feel just so genuinely collaborative. I think there's just consensus that we can do more and different and more important things if we work together, and you see a lot of that in the group.

And then in terms of your own research, I have so many people that I would like to introduce you to because, as you know, we have a whole contingent of people in the School of Education who are doing NLP technology education stuff. There are people in the business school who are doing really exciting things related to technology and organizational behavior that you might want to connect with. Management Science and Engineering has this whole computational social science component to it, in addition to the kind of biz school connections. What am I leaving out? Of course, the whole world of CS and of course, last but not least, I want you to meet our sociolinguists, because I think you'll have a lot of common cause with them around how we can build language technologies that are more sensitive to variation and aspects of personal identities. I don't know – I could set up a lunch with you for every day in the fall quarter and still not run out of people that I think you should meet!

Diyi Yang:I would love to have that. Thank you for sharing that information.

Chris Potts:And I actually want to make some of those connections with your projects as we talk, but before I do that, I can't resist asking – are there things that you're going to miss about living in Georgia?

Diyi Yang:Oh yeah, of course. I think Georgia is great and somewhat undervalued by many people. I have a lot of friends there. Even when it was a pandemic, we were able to meet frequently. So definitely people and friends. I'm a foodie, so I really like Southern food and sweet tea, or unsweet tea – you have to have that for every meal. In Atlanta, we actually have an area called the Buford Highway and it has lots of restaurants and choices. If you drive over on the weekend, I think a few hours take you to Savannah. That's another foodie city and with rich history and landscape about the coasts, etc. I also feel like the weather in at Atlanta is very attractive, because in summer you'll get a thunderstorm almost every afternoon, so that feeling is very refreshing. So that's sometimes I will definitely miss a lot, as it's very different from California summer.

Chris Potts:That certainly is different because here, where we are in California, in the summer, there is essentially 0% chance of any real rain. It's just nonstop sun. The nice part is that it always cools down because it's not humid. So the evenings are just endlessly pleasant I find. As someone who grew up on the east coast, I feel like the summers here are kind of magical in that regard. The heat in the day and then a kind of relaxation of that in the evening. Yeah. Very cool.

Let's talk a little bit about some projects that I think do have exciting connections with work that's happening at Stanford. I thought I'd start with the VALUE project. This is really inspiring. Do you want to say a little bit about the project goals and how it came about?

Diyi Yang:Yeah, sure. So we actually didn't start with the VALUE project in itself. We actually were working on something else. I work on, as you mentioned, bias and hate speech, etc., in online digital space. We were doing hate speech detection where, given a specific text, we want to classify whether it's hateful, abusive, offensive, or not. We found that, in this process, there seems strong bias against dialects or slang. If people speak certain dialects, the system may categorize it as abusive or hate speech even if it's not. And that actually introduced a big problem for many, many applications. So think about online moderation – these days, social media platforms use moderation algorithms to do the data content removal automatically. So there is a strong bias. And also I think broadly the system is having a hard time understanding speech and language patterns. This is not only for specific groups, but many different personalities or cultures we observe.

So, in that hate of speech detection work, my students and I, we realized that there is this kind of bias, and one of my students speaks Black English, so she was a strong advocate for this line of work in terms of more equitable technology for African American Vernacular English. And then we go to these resources because we were thinking, "Can we do something like stress test our current technologies or models or data sets in terms of different dialects?" So that's how we moved to these stress-test data sets for dialects. And there are many dialects. We started from this African American Vernacular English. So that's, how we got to this work.

What I really like about this work is we actually pick a participatory design approach here, which is a co-design session or includes several co-design sessions we did with native speakers.

The thing is that we, as researchers, sometimes we work on a language and we really need a domain expert to share with us what's going on there. One notable thing in this participatory design approach is we actually work with a local organization in Atlanta called Dataworks. Dataworks recruits people from economically disadvantaged neighborhoods and under-represented groups to train and hire them as data developers. So we basically recruited six people from this organization and had co-design sessions where we talked about, "How could we make a technology useful for your community? How can we build resources for fair comparison?" And then you collect the concerns and the feedback from native speakers.

After that process, I think we had a richer understanding of the problem space – what we could do, what we should do and should not do. And then we came to linguistics. So it's actually a combination of both linguistic knowledge and user feedback. We look at the grammar structures associated with Black English and then first do some transformation of Standard American English into Black English, and then ask native speakers to check whether it's reasonable, whether it's socially acceptable, etc. So that's the process behind it. And then later, when we build a model on this, or when we test existing models on this VALUE corpus, we see there is strong disparity. And even if you do some sort of adaption, the gap is still there.

Chris Potts:It's really cool that you recruited those native speakers to participate and kind of help with this translation process. And the result of all of this is a kind of GLUE benchmark but for African American English, right?

Diyi Yang:Yeah, I think you can understand it in that way, but one thing I need to share, just as limitations for this type of work, is, in this translation process, we can only do it for some syntax and morphosyntax. There are a lot of things to do with pragmatics and semantics. So we are not able to have very authentic language represented here. So instead of viewing it as a GLUE benchmark – I think a stress test of the GLUE benchmark might be a more accurate description of what it is about.

Chris Potts:Right, that makes sense. And so do you find when you use it as a stress test, that there are some systematic limitations of current models with respect to African American English?

Diyi Yang:Yeah. I think certain linguistic patterns – for example, there are these negative concord constructions, where you use two negative morphemes to express something negated. That one you see models struggle a lot. We actually look at existing models in terms of GLUE versions, I guess, Black English versions of GLUE, and you see all sorts of error patterns popping up from this process.

Something I feel like we received a lot from the co-design sessions, by interacting with speakers, is, to them, the GLUE version of their language is not a good way to represent what's going on because when they look at the sentences, it's a little bit awkward. You probably won't see a New York Times article written in Black English so that type of distance might limit the representation of this type of corpus. There are also a lot of nuances. For example, dialects also evolve over time, and people might comment on something like, "Oh, this pattern is cool. It's valid. However, I only see it used by my grandparents or I can only see it in data from previous generations," etc. Or like, "I'm in Atlanta and people in Chicago or people in Los Angeles, they might say it differently." So there are different patterns we observe. And I think that those nuances might also introduce challenges for NLP technologies.

Chris Potts:Sure, sure. I mean, I guess that's a pervasive issue with benchmarking in the way that we do it.

Diyi Yang:Yeah.

Chris Potts:In the presence of linguistic variation, which is everywhere in the world.

Before we leave the VALUE topic: does it connect back for you with that problem you identified about hate speech detectors disadvantaging precisely the groups that they were meant to help out? I think that's a common cycle we've seen in that research. Do you see a way out of it?

Diyi Yang:Yes and no. It's actually more challenging than we imagined. When we finished this first version of VALUE, we went back to see how we can leverage this, to mitigate some of the biases we see with hate speech detection. My students have presented it this year at a conference. The takeaway is that, if you remove certain of the lexicons, it will reduce the disparity for hate speech identification, but it didn't reduce the disparity for abusive or offensive language detection.

And we also try to see, since in this work we introduce different transformational rules from Standard American English to African American Vernacular English, so you can actually choose the density of different sentences to convert it. Some of the very pervasive patterns actually didn't reduce disparities. So I think there is a long way to go in terms of the context of hate speech. So I hope we could have some more nuanced understanding of the space.

Chris Potts:I mean, that sounds like a central insight there is, that we really have to think about the context in which all this language is being produced, as given by the social network and everything else that's happening around whatever posts we're trying to filter or flag and things like that. And that seems really consonant with your whole take on research, which is that we could do NLP but in its broader social context.

Diyi Yang:Yeah. I agree.

Chris Potts:For the benchmarking stuff, I find that interesting as well, because it does look to me like, in your research, you and your group have released a lot of benchmarks, often as a way of kind of pushing people to work in a new area or re-conceptualize a problem. VALUE is one of these, but also your new ACL best paper award winner, "Inducing positive perspectives with text reframing." This is another case where I think you're releasing a benchmark in the hope that will attract people to think in a particular way. What's your take on the benchmarks you all have released and maybe benchmarking in general in the field?

Diyi Yang:Yeah, this is an excellent question and something we have been thinking about all the time. As you probably noticed, the research we are doing sometimes is very new in the sense that we need to provide something for people to play around. For me, I personally think that a benchmark is like an entrance or a doorperson for our models to enter a bigger world. It's really a starting point. It's very useful for many reasons. That's why we use it. You can actually get very representative examples from experts of the problem space and the directions for our model development and evaluation. You can also do like apple-to-apple comparisons. So I think a lot of the progress in our field is driven by good benchmarks and data sets. That's something we should value in the first place.

Having said that, I think there are also downsides of this type of benchmark. And you also discussed this in your Dynabench work. Most data sets are static, sometimes suffering from not being able to represent the entire space. There might be data set artifacts or biases, etc. Overall, I still think benchmarking is valuable, as in transfer models. Actually, in the ACL work on positive reframing, this is why it's very close to my heart just from the nature of the work. The task is you generate a positive perspective for a given negative narrative. If I use the system to produce a positive reading of your question – I didn't do this with the system – but instead of asking whether benchmarking is a good force in the field or bad for the field, we could ask – this might be suggested by our system - how could we make a benchmarking a better force for the field?

I think Dynabench and data set cards, they definitely are great initiatives for this direction. We need more adversarial cases or adversative thinking for our benchmark even before the dataset is constructed, and then construct them in this kind of dynamic way.

I want to offer two thoughts from a more HCI or more user-centered perspective. One direction is, I've been thinking about how we could make benchmarks from a community perspective. If we think about Wikipedia, all the articles are actually contributed by millions of volunteers. So then the problem is, how could we encourage people to contribute data points for specific tasks? To share the data set or whatever? Of course, there are privacy concerns, all sorts of considerations we need to implement before any portion of this idea.

The other part I found fascinating, we are doing something in this space and hopefully release data later this year, is to have a game engine where people play around with certain games to generate the data. In this scenario, you actually get the excitement to produce data points, probably in a natural way.

Chris Potts:That also really resonates with me because one frustration I've felt with the way we develop benchmarks and the associated tasks is that we essentially have people do machine tasks, right. Instead of having them reason about language, we say "Here's a premise and hypothesis, give me one of three labels." Or instead of answering a question, we say, "Find a substring of this passage that you think corresponds to an answer," because we know it will help the systems that we want to develop. Whereas reasoning in language, question answering in language, these are free-form, social, interactional things, and we're making it too easy for our systems by leaving all of that stuff out. And probably also getting benchmarks that are just intrinsically less valuable because they were created in such an artificial way.

When HCI people see what we do to create our benchmarks, when they see our MTurk templates and stuff, do they react with horror or what, do you have a sense?

Diyi Yang:Well, I'm not in a good position to comment on this. I'm more like in the middle, but I do see the different focuses of each community. I think from the NLP perspective, we pay attention to agreement and answer more and that sometimes includes all sorts of nuances, etc. And then on the HCI side, I think they have a stronger focus towards user needs and user opinions. There are both pros and cons to each methodology or thinking.

Chris Potts:Sure. That's very generous. I see you are positively reframing these statements! But for the agreement thing, we just talked about all this language variation, variation in social identity. In forcing ourselves to care so much about agreement we are basically making sure that we don't see any of that important variation in our benchmarks. And actually we're probably sort of in some sense just penalizing it in a way that's damaging if you think about having truly socially aware NLP. So something more free form like playing a game, solving a task together, and seeing what goal oriented language emerges from that does frankly just seem like a better avenue. Much more demanding for the data set creator though.

Diyi Yang:Yeah. I definitely agree. I think that there are also some other dimensions we need to pay attention to, or at least have a mind-set to be prepared for. For example, in all these type of free social interaction setting, we actually don't have agreement if we think about very nuanced societies. However, right now in our annotation or benchmark creation process, we emphasize so much in terms of, "Oh, I need to pass that threshold to claim that I have agreement," but I wonder whether that's a valuable goal to pursue in the first place and whether we should pay more attention to understanding the disagreement popping up in this process. Does that tell us something deeper about the task itself? Is that a reflection that we are not getting something what we want in the output? So those type of thing we need to be prepared for before we collect data in this real world or adaptive fashion.

Chris Potts:Yeah. Very cool. There's another dimension that I'm curious about here, and we kind of already touched on this with VALUE, because with VALUE, you recruited essentially a team of specialists, domain experts who could speak the languages that you cared about. And that was really crucial to the project. For other benchmarks that you've created, you've done what most of us do and rely on MTurk. How is that going for you and do you think we need some alternatives? What are your thoughts on the platforms that we're all using for this?

Diyi Yang:Yeah, that's another great question. My first reaction is if you simply put a task on MTurk, that's not going to work. You need some level of best practices incorporated in that process. This is not only just for like nuanced social constructs we are looking at Even for regular NLP task, like topic identification, you may want to provide a common ground for people to make judgments.

And I think there are a few best practices people always use to recruit trustworthy workers. For them you can look at annotators who have done a lot of work, certain experience, and with high approval rates. You can also leverage – there is a function, if it is Amazon MTurk – there is function called a "qualification," where you can post a practice of the task and then manually review to see which annotator is qualified or they can work on your task.

You can also do pilot studies. Before you do something bigger, have something smaller to test and to iteratively refine your task. There are many very nuanced things to care about in this process. For example, a good UI design. This is something we often ignore, but I found that good UI design actually can ensure the annotators do a better job. Because if it's easy to use, if it's a facilitated a task, then it can always lead to much faster and better outcomes.

One thing I think we need to pay attention to is the fair payment in this process in terms of paying our annotators with reasonable compensation, etc.

Since I work on these very nuanced social constructs, something very specific to our task is sometimes we need to know annotators' self-reported background a little bit to better interpret their annotations. For instance, if we want to have annotations for something ideology related, then we probably need to know what type of background people have in order to interpret their judgment, etc. So there will be some such background information or questionnaire we need to ask. And that might trigger IRB or not. So, that's an entire topic we can chat about more, but basically the relation between IRB and MTurk is quite a complicated.

Chris Potts:Oh, well let's return to that. But one thing I wanted to surface there is: we do what you do, which is we have some qualification tasks essentially and then we have pre-qualified people and typically they're the only ones who can see the tasks we launch. And it's a group I would say of about 500-600 active users who do outstanding work. When people complain about MTurk quality, I always think of this group of people who routinely do fast, high quality work for us. And that's really exciting because it means that we can basically just use all the data that we get and not have to worry about all sorts of complicated filtering and stuff. So that's the good side. The bad side of this though, is that this is probably a very particular group of people. And so I certainly couldn't use them as a representative sample of really any population, right? I can use them if I just need good labels for a task, but not if I want to understand a social process. And when the social process part comes in, I feel much less sure about how to properly use MTurk.

Diyi Yang:Exactly. So I think that's a really good point. One bigger disadvantage of this type of platform, like MTurk or Figure Eight, is that there's no way for researchers to talk to annotators. If you observe something abnormal, let's say not as expected, you don't have good understanding of what's going on. So it's really hard to collect feedback for these nuanced social constructs. There are other alternatives. For example, there are platforms like Upwork, where you can actually work with more specialized annotators, domain experts. So those are more like gig work. And then there is Prolific, which is also a platform you can hire annotators, etc.

In our work, in my own work, I actually work more with domain experts. For instance, when we were working with the American Cancer Society, in terms of the social support expressed in people's messages, we actually worked with students from nursing school. For VALUE, we worked with native speakers, etc. So I think the choice of annotation platforms or annotators needs to be connected to the specific problem. And then who gets to make that decision, who is represented in the process, those are all very important dimensions to consider. I like your question, where MTurk only represents a specific group of users. I think this representation issue is something we really need to improve for our data set collection.

Chris Potts:Yeah. And it's maybe even worse than that, because there might be a whole lot of really well-meaning Turkers out there who are doing careful work, but we tend to regard it as bad work because they don't understand something about the task we posed or the language they speak is subtly different. And those are the people that I have in mind when I think about exclusions that would distort a scientific picture, because it's like, I claim that this is a systematic property of how people reason in English and I'm just wrong because what I've really done is exclude all the people who implicitly don't agree with my prior hypothesis about how reasoning happens in English, right. That's the worrisome cycle there.

Diyi Yang:Yeah, I definitely feel the pain. I think the annotation process needs to be more interactive. You can have channels to talk to your annotators. When we were collecting the Table to Text data set, we actually had regular meetings with our annotators on a weekly basis and we also calculated or monitored the annotations frequently. So you can see sometimes if you don't talk to your annotators for two weeks, the annotation agreement drops a little bit. And if you talk to them more and have more discussions, the agreement increases a little. That's a very – I guess compared to social interaction – a more objective task, Table to Text, but still we observe patterns like this.

Chris Potts:And what are your feelings about the IRB when it comes specifically to data set collection in NLP?

Diyi Yang:Yes, I think the complicated part is with MTurk. IRB in general, I think, these days, in many US universities, they might provide a decision tree to help you make a choice to see whether an IRB is needed for your task or not. And usually it's about interacting with human subjects. If you want to study human subjects, you need IRB. However, as we see in our field, the tasks are getting more and a more nuanced. Sometimes tasks might be very challenging for people to work on. Think about annotators helping us with depression labeling, categorizing the depression level of each post or deciding whether a post is about hate speech or not. Doing the job for several days, it's going to be not healthy for their well-being. So that's the boundary of studying human subjects or having more interaction with human subjects, that might introduce risk. In general, I think the IRB is required when someone studies Turkers. Let's say if you want to get self-reported information from Turkers and then you probably need to have that. However, for the very sensitive cases I mentioned, in those scenarios, you may also want to start the IRB process to get a feedback or to think carefully about the potential risk in this process and what are some procedures you should do to when such things happen.

Chris Potts:My own view would be that I think it would be fine to require IRB approval in some sense for all the data collection that we do. And the example you gave of depression is really wonderful because I think that forcing IRB oversight of that project would force you as a researcher to make a case that the benefits outweighed the risks. And if you couldn't make that case, if you were like, "Oh, we want data on depression, just because it would be a fun 19th task in our benchmark", maybe you couldn't get that to justify the risks to participants. But if you had a compelling case that you could improve mental health for people, then the IRB would probably approve it because those risks would be tolerable. And that just seems healthy for us to have to articulate. I have much more mixed feelings about IRB oversight of NLP research in general, but for the data set collection we do, which really ought to be like a psychology experiment, I feel like, "yeah, why not?"

Diyi Yang:Yeah, I totally agree with the risk versus benefits argument. I think in addition to that assumption, we also have to prepare for the worst scenarios. If people feel uncomfortable in the annotation process, what would be the best practice? If people, let's say get influenced a little bit by the task, what would we do? All sorts of solutions or actions we should do whenever "what if" happens. I think that's also very, very important when we, or before we started the data set collection.

Chris Potts:It even just reminds you that there are people out there so that you don't slip into to that habit of thinking of it as a magical service, the way the name might suggest. No, there are people there with feelings and reactions and everything else and we should all the time be thinking about that and the IRB could maybe nudge us in the right direction there. And I will say you're soon to arrive at Stanford. My own experiences with the IRB have been fine. They're very supportive of the kind of research that we do. And so I've never really, at Stanford, thought of it as an obstacle that I needed IRB approval for this stuff. And I've always felt good that I can give people contact information if they feel like they were unfairly treated in the study and stuff like that.

Diyi Yang:Yeah.

Chris Potts:Let's pop up one level. One really interesting aspect of your research is that you do often go beyond just the NLP experiment and really think about intervening in a social process on a website or something like that. Which really is research with a different character than I'm used to. What is it like to do that research? What special considerations do you need to bring to bear if you're actually going to change the world actively as part of the project?

Diyi Yang:Yeah. Oh, that's a great question. I think it's actually a very slow process for myself. I picked up this intervention approach over time for a very long time, especially in my earlier days as a student. I wasn't able to see the big picture of why we want to do a specific project and where my project is, where is it meant to go or land, etc. Later I got exposed to HCI/social science problems and projects. Those experiences inspired me to think more about the role that our solutions play in solving real world problems and getting more into the human side. So I guess in this process it was very fascinating to me that – or something I realized by interacting with people is, we really need to focus on those problems first before we develop solutions, rather than finding which problems can be a good test bed for my algorithms.

It has taken me a while to get that part clear when I was a student. I think doing interventions is very important in general. We get the chance to experience the entire cycle of finding what's really important and what's most needed. You talk to the experts and work together to formulate the problem, and then that's our favorite developing algorithm or solution part. The evaluation is very, I guess, non-trivial at all and sometimes more challenging. It's not only just a single metric. We need to think about what is a good model, what that means for a real problem, and what evaluation should be like. It might be an aggregate of many, many factors and more importantly, who gets to tell us the evaluation results?

To give a concrete example, we did a recommendation system to recommend content and caregivers to cancer patients, for the American Cancer Survivor Network. There was a wide range of factors that we needed to consider beyond accuracy or click-through rates. For instance, efficiency, cost, speed, memory, new users, fairness, all sorts of things. It was a much bigger space for me when I was trying to deploy my fancy algorithm into a bigger website. This is a much bigger problem space.

I think it's inspiring us to think deeply about the purpose from the very beginning. And more importantly, I think that these days, sometimes we pay attention to like, "Oh, there is a three percentage improvement." So in those intervention scenarios, you really need to understand what the improvement introduced by the algorithm actually means when it comes to the real world setting. You'll get a lot of natural adversarial attacks in this scenario as well. So it's going to be a trade-off of many factors.

Last but not least, I want to see something that is, I strongly believe that the one very big impact of our research or NLP research is its use in a meaningful way for people. It should not stay just on arXiv, but also in the hands of people.

Chris Potts:Very cool. And all these themes remind me, you have a wonderful short piece. I'm not sure of the history of this piece. I'd be curious about that. It's called "Six questions for socially aware language technologies." Actually, before I ask my specific questions, where did the piece come from? It's like a position statement that you wrote.

Diyi Yang:Yeah, that one is actually a letter to a journal. It actually starts from a real survey. We actually had a survey paper before it called "The Importance of Modeling Social Factors of Language" with my collaborator Dirk Hovy. We got the impression that our current systems really break down. If you look at specific studies, it actually fails when we try to interpret social factors of language, and that limits our applications. So we want to categorize, or we want to formulate, what are those social factors that might be important for our language understanding from a very practical perspective. And then we introduce a real-world taxonomy where we look at speakers, receivers, their social relation, in what context, guided by what type of social norm, which culture and for what type of community goal, etc. So that type of practical taxonomy actually provides us with, I guess, a guideline to think about what's going on, and how we could improve the system with the introduction of those social factors.

That was the background of this event. And then, when we were working on that – I always wished we had more space for papers – I got many fascinating questions that I don't know, what's the answer. And to be honest, I don't have solutions for the six questions I asked in my letter. So I wrote it as just a two-page letter to share those questions or some thoughts I have in mind. So that's where it came from. I really like that piece. I think it's a new format. I'm trying to ask my questions and then trigger discussion with people.

Chris Potts:Well, I love it. And we've covered a few of the questions already, because they pertain to sensitivity to variation, multilingual aspects of the systems we build, some aspects of the kind of interventional work that you want to do. One question in particular really intrigues me there. "Should social NLP models passively learn or proactively experience the world?" And I believe that you suggest that these kind of adaptive systems will learn better, and that's probably true, but I do want to ask a pointed question there. Do you see some dangers that might come from that kind of adaptivity to us, and to the environment, for these future systems?

Diyi Yang:Yes, definitely. First, going back to the question itself. I think it actually comes from thinking about the human learning process. I know this may not be the perfect analogy here, but if we think about human learning, we experience the world. We receive feedback either explicitly or implicitly. And whenever there are things we don't know, we can ask questions or we can interact to figure it out. Especially for nuanced social constructs, they are very context dependent. So in order to figure out what's appropriate or what's not – a static data set is not going to give us that. And from that perspective, I sense this adaptive dimension will be a next step to try. My feeling is that, as you mentioned, we may not be able to go there unless we solve a lot of the factors or all sorts of issues that we have so far.

I want to think about it from the other way. So let's say we want to develop some agents to pick up social capabilities, by interacting with people in a given environment. The space of all sorts of interactions is going to be huge. There might be bad behaviors. There might be some sort of inappropriate behaviors. So we have to make it in a controllable way, to make it responsible and meaningful for the capacities we want the model to learn. So to me that type of planning or that type of control will help with certain dangers we might foresee. So we really need to design specific actions we want to reward or other actions we want the model to avoid in that process. It's a very early stage idea. But I think letting models learn in those socially rich and socially situated environments seems a fascinating direction to consider.

Chris Potts:Yeah, so that perfectly covers my range of feelings, because I totally agree that big breakthroughs for these systems will come when they learn via interaction with other agents, in particular with us. And on the positive side, that might be crucial to them being sympathetic, to understanding what the human experience is like, and that should make them maybe more ethical.

On the other hand, exactly that adaptive capability is what will enable them to manipulate us in all sorts of highly customized ways and also allow us to manipulate them because after all, at this point, you're kind of talking about a little quasi-human agent, like a child. And we know we have lots of influence over the patterns of behavior and action that they form. And it's going to be the same thing in this new technological realm. And that might worry you enough to think that we really ought not to be developing adaptive systems, but ...

Diyi Yang:Yeah, sometimes just thinking about ideas widely – it's very, very cool and yeah, definitely an important research direction to think about because so far I feel like we are still at a stage where we either teach models what to do by putting them into a dataset or we use a fixed corpus for them to learn. It's limited in some way. We probably need a new paradigm to break it and then have some new experience for our algorithm design and technology.

Chris Potts:Yes!

Let's switch gears a little bit, because I definitely want to ask you about some of the fun things that are on the horizon for you. One in particular that I'm really interested in – you and I chatted about this when we last met – is this "Workshop on Shared Stories and Lessons Learned," which will be at EMNLP. This looks like a really fun, refreshing new kind of event. How did it come about, what are its goals, what do you all have in mind for it?

Diyi Yang:Yeah, I'm very happy to chat about that workshop. I think we all believe that the driving force of the progress in our field is the people behind the work – people working together producing those big ideas, etc. So we learn a lot from their work. So the idea or the goal of the workshop is to see whether we can – instead of all our current workshops are about work, like presenting work posters, etc. – can we have a workshop to focus on people behind the work, get to know more about the principles or strategies people use behind their good work and then more importantly, roadblocks, challenges, mistakes, or lessons people learned in this process.

I think it will be very valuable for many researchers across different career stages. For example, myself, including many junior faculty, I know we often reach out to senior people around us for advice and feedback. And we can imagine, I guess, fresh PhD students reaching out to earlier career researchers or senior grads. And then earlier career people reach out to middle career, late career, senior professors. And then also in companies there are people first entering the field, reaching out to industry leaders.

Often I want to mention that such resources may not be available to everyone. Some people may not have the network to get feedback or help. Or maybe only a few people will be approachable to us. But this process is very important. Sometimes if you don't have that type of resource or network, you may end up missing opportunities or information. So the goal of this workshop is to see if we could make the sharing of researchers' stories and lessons learned be accessible to everyone in our community. And we hope that this is going to be inspiring – helpful for people who are struggling, who might be struggling with making a choice or feeling lost right now.

We actually did a similar workshop at ICCV last year during the pandemic. The pandemic triggered many different type of emotions for all of us. We found it was well received by participants. So this year we are setting up for our NLP community to be more specific and representative of different aspects of our research. Currently our goal is to make it topic oriented. For each topic or research direction we care about, we actually line up speakers to share all sorts of things I just mentioned. And there will also be panel and the Q&A discussions to interact with the audience, and the mentorship. It's going to be a new format. I'm actually excited but also nervous to see how it might go.

Chris Potts:Well, it really resonates with me because this is exactly why I started doing this podcast. I was teaching in the spring. The course content is mostly in screencasts now. I wanted to do something with our in-person meetings, because they got to be in person at long last. I didn't want to just have more guest lectures, because for the reasons you just pointed out, those send a kind of mixed message, because it's always someone coming in and telling a story of success. Whereas the students are having to do projects and feeling a much wider range of emotions around research. And so I thought, "Okay, I'll give a behind the scenes look." I'll invite people like you to come and share their personal stories and their perspectives on the field. And I think it really did resonate with people, and I certainly enjoy it, because I appreciate hearing about more than just the very thin surface of research outcomes that we normally get from each other. And so I'm so glad that you're doing this in the context of a meeting like EMNLP. I think it's just wonderful. Are the details up on how it will be structured and who's talking? When I last checked they weren't. Have you finalized all the details?

Diyi Yang:We are still in the process of setting up speakers and then different formats to make it more accessible since it's also going to be a hybrid conference or hybrid workshop. So we want to make sure different components can be set up smoothly. So I think by next month or so we will have a more detailed schedule and I can share that with you.

Chris Potts:Cool. Do you yourself have a shared story or lesson learned that you want to share with us now that would be appropriate for the workshop?

Diyi Yang:I'm not part of the speaker team, but I might consider doing something like that. I really like podcasts. These days I actually listen to all the previous podcasts while I'm doing my daily walk around the neighborhood. So I might consider doing something. It's not fixed it yet.

Chris Potts:I don't want to put you on the spot, so feel free to say no, but you told me about your experience of being one of the spotlight speakers for the Young Rising Stars at ACL 2022, which is a huge kudos to you. Congratulations for being one of this elite group. But then you told me a story of the technology failing in every way.

Diyi Yang:Yes. [Sighing.]

Chris Potts:And I tried to find the recording, but I don't have the proper registration for that ACL meeting. So I think I can't access the video. I'm sure it's not what you remember. I'm sure it was actually very successful, but I thought also people might like to hear this story of you feeling unsuccessful in a highly prominent place. Do you want to share it?

Diyi Yang:Yeah. It was not only unsuccessful, but also embarrassing. So I think my experience of giving talks in this hybrid fashion, it's very similar to this, what do we call it, "live demo curse" in HCI. It's whenever you are going to do a live demo, your demo crashes.

Going back to the talk, the talk was happening in a big auditorium in Dublin with some seating capacity like 2000 people or whatever. I was on stage. It was almost my turn and I remember our session chair was Bonnie Webber. She told me that the technology may not work as expected. I was already very nervous about speaking in front of so many people. I mean, I started on my career right before the pandemic. I don't have much experience of standing there. And also I'm going to talk in front of experts in our field. So when I hear that, I was like, "Wow, technology again, what should I do? I don't think they are going to allow me to reschedule that at that moment."

When I was nervous thinking about all these potential technology issues and what I could do, the technology stopped working. Basically the slides were not playing on the big screen, and it was also not working on the notes screen in front of me. So I was standing on a big stage and I had to tell myself like, "I'm going to give a 15-minute talk without slides and this is supposed to be very representative of my earlier career so far." I looked at all the people downstairs. It was so quiet and I felt like I might need to do this like a TED talk or maybe stand-up comedy. I do work on humor. So I know how stand-up comedy works. So I started my performance or talk with something like, "I have the best slides in the world, but I'm now going to show you today."

You know what's funny, right after that sentence, the slides showed up on the big screen. And it was not the best slides, right. It was just a regular one. And then I was like, "Okay, I can start my job right now." And then the clicker didn't work. It was like a double kill. It was playing my slides from the last page to the first page. So all the secrets or answers to my questions were shown to the audience before I could tell the story. It was a roller coaster for me. Anyway, I feel like this is a common thing these days, especially during the pandemic, for many of our hybrid conferences. So it was so embarrassing, but in the end, I think it went well – the audience was very supportive. They understand what's going on.

I also want to connect this to my other feelings. When I was a student, I was always curious why my professors always have issues with computers and a presentation. Now I feel like I understand it.

Chris Potts:Oh, so it means you finally hit the big time. You've reached a level of seniority where the technology stops working for you.

But this is perfect, this story, because I think for junior people hearing this, just to know that you get nervous before talks is probably eye-opening to them, because they probably think, "Well, I get nervous, but the true professionals just stride on stage with endless confidence and give their talks and don't think another thing of it." And so I think for people to know that, you know, I get very nervous about these things, incredibly nervous. I stew about them for days, weeks, before they happen. Especially if it was something like ACL. And then just your observation that when these things happen, the audience is just instantly on your side. They're just cheering for you to like get through it. It's almost too funny that the slides appeared just when you made the joke about them being the best slides. I kind of feel like I need to verify that you're not making that up, because it's too perfect.

Diyi Yang:Oh, I'm not making it up – you can check the recording. It was so embarrassing. Like right after I say I have the best slides and then they just show up – my regular slides. If I knew that, I'm going to make something different.

Chris Potts:But I do wish I could see the recording mainly because I'm sure – and this is like such a common lesson for these things – that it's not as bad as you remember it. That it was probably, from the audience members' perspective, kind of a normal sort of event where the technology just didn't quite behave. For you it's everything but for the audience, it was like, "Oh yeah, this is cool, Diyi is giving this rising stars talk and I want to be there for it." And the bottom line is that because the video's hard to access, all anyone will remember is that you're an ACL Young Rising Star.

Diyi Yang:Thank you.

Chris Potts:But that's a wonderful story. I hope there are lots of stories like this at this workshop, because I think that these can be so inspiring to people at just the moments when they're feeling the lowest about their research or their professional development. And to know that we all go through these things is really important.

Do you think they'll also be, at the workshop, stories of research dead ends? Because I think that's another thing – junior people think that for established folks, every hypothesis they have is validated, every project ends up in a publication. And for them to know that actually it's hit or miss for all of us is probably valuable.

Diyi Yang:Yeah, we are going to invite a diverse group of people at a different stages, including students, postdocs and junior faculty, middle career, senior faculty, and try to make it across the world. Yeah, I'm very excited about that to see how it goes.

Chris Potts:Very cool. Yeah, I can't wait to see what happens for this, and thanks again for doing it.

By way of wrapping up, I'm curious to think a little bit ahead toward you being at Stanford and everything. So are there kinds of research and teaching that you want to do at Stanford that you haven't gotten a chance to before, or new areas that you want to break into?

Diyi Yang:Yeah. I'm super excited about the new journey. As you mentioned in the very beginning, there are a lot of people I could collaborate with. Overall, I think it's going to be great. And I'm just very excited. Research-wise, I think we have been talking about socially aware NLP. That's going to be one of my big focuses, where we want to make current NLP systems more aware of factors such as speakers, social norms, culture, etc. So more around the understanding in that perspective. I'm also a big fan of doing interventions. So I want to start this line of political NLP, where we could do more translational work to turn NLP systems into systems that people are using, especially in socially important domains, such as mental health and well-being.

One project that we have been working on is how to use NLP to augment therapists or supporters in the context of peer to peer therapy. In that scenario, we will collaborate with people in psychology and clinical health to make it happen.

Teaching-wise, I'm very excited to introduce new courses. So I plan to do one course on computational social science, to share with students what's the cutting edge research going on in this social space. On the more NLP side, I hope to start a new course on human-centered NLP to kind of share the techniques and also important problems at the intersection between humans and NLP. For example, there are bias, fairness, privacy issues, this user-centered design, human-centered design, and a lot of other topics. So I'm quite excited. And I really hope that I can talk to a lot of people in this preparation process to make it a good resource for students.

Chris Potts:Well, I think those will be really exciting additions to our NLP curriculum. So that's just really wonderful to hear, and there'll be tons of student interest. There was one phrase you used that I'm really interested in: "translational research." I feel like as a field we're surprisingly behind on this, given the amount of impact we're having in the world. Do you have a vision for how we might be more thoughtful about translational research in the way that some other fields are?

Diyi Yang:Yeah, I think definitely medical or healthcare societies, they are doing a much better job than us in terms of different levels of evaluation, etc., so that's definitely a very close field that we should look up to. And by "translational," I mean, it could happen at a different levels. It does not mean that every work needs to end up with a system that millions of people are using. It could stay at this problem side, even just thinking about the new problems we are going to introduce. It should not come from the lab. It needs to be grounded in people. What problems people are suffering from right now, what type of needs they have for something that can improve their life, etc. So even just from that problem formulation side, I think translational is something we should introduce, or that mental side of thinking.

And also you can relate it to methodology. When we think about all these fancy complicated models we are going to introduce, can we pay more attention to the bigger ecosystem? Related to the carbon footprint, related to whether it's worthwhile to do something with such big costs in terms of computation, resources, etc. So there are different levels. And more importantly, when we have a system, could we actually put it into the hands of people? Can we make it work in a sense that people use the system?

And one thing I want to mention is that HCI, in many cases, what we found is that you put a lot of time to design a system, you spend a lot of time and it's beautiful in a lab setting. Whenever it comes into the hands of people, the story might be opposite. You might observe something totally unexpected. So I think only by that stage you know whether something is working and all the other factors you can observe. So it's going back to the intervention I mentioned earlier. I think transitional, to me, it's more about thinking in terms of different levels. But I am also very interested in the systems we can build. One thing in this process is the importance of domain experts. We need to work together. Talking to people in psychology, social science, cognitive science, psychology, psychiatry, etc., to have a deep understanding of the space and then really rely on each other to make the efforts work.

Chris Potts:Yeah, excellent. And another thing I wanted to circle back on is the human-centered NLP class. That's interesting to me. Is there a way to articulate how that differs from a kind of standard NLP course in terms of its perspectives or maybe the project students produce? It sounds like a class that I myself would love to take.

Diyi Yang:Thank you. Yeah it's going to be a new course. I don't have the perfect syllabus lined up for it, but several topics I've been thinking about. There will be techniques we may want to cover. There will be all sorts of issues. For example, there will be a few lectures on user-centered design, value-centered design, and the human-centered design. So those are all different design principles. And I think we have seen many use cases of those type of design combined with NLP systems. For example, let's take the value-centered design for an example, how would you get access to, or could you inject those different values into the generation systems? Then it might involve a little bit of reinforcement learning. If we think about human in the loop, how would you leverage human feedback? We will also talk about accountability and fairness.

So that's one part. I think it may have overlap with some existing courses as well. We want to talk about privacy because, in many socially related tasks, or even just broadly for NLP models, these days, we know that they suffer from privacy leakage, etc. So what would be some best practice to collect those data in a responsible way? And how could we design our algorithms to be more robust to all sorts of attack related to privacy? So there might be a little bit of differential privacy combined with NLP. Visualization is also going to be a big topic there. For example, the way we interpret our NLP systems to people is to realize what's going on with Transformers. So we can imagine certain visualizations we could have to facilitate the reading or assist making it reach people and our big models. There are a lot of topics. I welcome any suggestions from anyone on this. I think it's going to happen in the spring quarter. So still a long way to prepare for myself.

Chris Potts:This is wonderful. And that last thing you mentioned reminded me: another person that I need you to meet is Jeff Hancock, who is in Communication. He and his students are working very hard on how the public apprehends AI. What their mental model is of web search, recommendation systems, models like GPT-3. That's a part of this picture that's often left out, but it's incredibly important if you think about designing systems responsibly, and also the societal reaction that we're likely to get, because that's going to be shaped by people's assumptions about what these models are doing and doing to them. And those assumptions are often wrong, but just calling them wrong doesn't change the fact that those assumptions are out there.

Diyi Yang:Yeah.

Chris Potts:And if we are knowledgeable about those views, then we can kind of anticipate them and I think get to a more productive dynamic.

Diyi Yang:Exactly. Yeah. I really like that part, because how the public perceives our technologies is definitely going to be the other side of human centered. So yeah, I'm super excited about this new course. I haven't done anything similar, so I'm still in the process of designing which topics might be of most interest to students and could provide something different.

Chris Potts:Very cool. Final question: Are there things that you're kind of thinking about or obsessed with now in the area of the research that you're doing? I know it's been a busy summer for you in terms of moving and being at Berkeley now and everything else, but in the back of your mind are there some research questions that you really want to tackle or that you are tackling?

Diyi Yang:Yeah, we are working on a few very cool projects this summer. It's still a work in progress, but one thing related to the mental health scenario I shared with you is we are leveraging controllable text generation techniques to incorporate psychology or psychotherapy related theories together to make generation more meaningful for people to argument their abilities in this type of supportive exchange sessions. That's something I'm very excited about. It's not only challenging in terms of algorithms, but also in terms of how to react in different corner cases. And then whether people want to use your system – we have this beautiful vision that we can help you, but do they need it? It's actually a line of research we are working on right now. And then the second thing we are doing is, I have this very vague idea, it's still in earlier stage, is to see how we could have some explicit representation of social norms or even values so that we can teach our models in terms of recognizing them and also maybe use them properly in certain scenarios. So that's some ongoing work right now.

Chris Potts:Oh, well, this is wonderful and inspiring. And I'll just say again that I'm absolutely delighted that you'll be doing all of this stuff out here with us at Stanford. It is so exciting that you're joining us in the NLP Group, and for the Stanford community in general. It's just going to be great. And thank you for doing this interview too – this was wonderful!

Diyi Yang:Thank you! Thank you for the very great questions!