CS224U: Natural Language Understanding

Podcast episode: Adina Williams

April 20, 2022

With Chris Potts and Sterling Alic

Neuroscience and neural networks, being a linguist in the world of NLP, evaluation methods, fine-grained NLI questions, the pace of research, and the vexing fact that, on the internet, people = men.

Chris Potts:All right. Welcome everyone. I am delighted to have Adina Williams here. Adina is a neuroscientist by training and also in some sense a theoretical linguist. She got her PhD from NYU, working with a diverse range of people – I'd love to find out more about the committee and how all that worked – but she primarily focused on neuroscience. And then we want to learn about how she ended up drifting waywardly into NLP, where she has made lots of contributions.

One noteworthy thing is that we have a unit in this course on Natural Language Inference, and Adina is at least partly responsible for two of the three major resources that we have students focus on: the MNLI data set and the ANLI data set. And those are both exciting new avenues that we want to talk about for sure.

She's also done lots of work on things related to multilingual NLP, especially oriented around morphological parsing and things like that, and has a wide range of work that you might say engages with issues of pernicious social biases in our data.

She is now a research scientist at formerly FAIR, now Meta AI – I suppose you're all getting used to saying it that way. And we have lots of stuff that we want to talk about here.

So Adina, welcome! I wonder if we could just dive in. You have this new high profile paper called "Based on billions of words on the internet, people = men". What's the story there? I have lots of questions about this exciting research.

Adina Williams:Yeah. Thanks. Thanks so much for the invite. I'm excited to meet you guys and chat about my research and things happening in NLP and adjacent fields right now.

This is a really cool project, different from the usual things that I've been working on. It's in collaboration with some social psychologists at NYU, April Bailey and Andrei Cimpian. Andrei was actually also on my dissertation committee, so was part of that range of folks. And I was talking to them about morphological gender in various languages and whether there's a good amount of semantics that comes about from being, let's say in Spanish, a feminine noun or masculine noun, and even if you don't refer to a person.

So we've been talking about these sorts of complexities. And they asked a question in that talk about a particular type of bias that social scientists have been exploring for a while called androcentrism, which is the idea that concepts for people are often more closely related to concepts for men than concepts for women. "Andro", meaning men; "centrism" meaning central to the concept, prototypical of the concept. And they were wondering if we could use existing NLP tools to measure this in a new way. And in particular they were curious about word embeddings. And so we used their skillset in a lot of ways to find important words and good lists of words that were well vetted by participants for being prototypical of men and women and various other things.

We did something fairly simple – basically measure the cosine similarity between the embeddings for words for, let's say women and words for person, and then words for men and words for person. We did a couple of other follow-ups as well, and found that in general the cosine similarity is smaller for words for men and words for people.

Since these are large-scale data sets, we used static word embedding so far, just because it's easier for social psychologists to get the intuitions of. We're considering moving towards more dynamic things. But we used things like GloVe, fastText, trained on large portions of the internet, which they felt was an interesting data set for compressing the social meaning of concepts as it's constructed by a large-scale community, let's say, on the internet. And so it seems like we've found a new source of evidence for the androcentrism hypothesis that's present in social psychology right now.

Chris Potts:Very cool. A couple of things caught my eye methodologically in the paper, and one is the emphasis on preregistration. That's intriguing for me because it's widespread in psychology and I think growing, but not so much on the scene in NLP. What are your thoughts about the preregistration idea?

Adina Williams:Yeah, I'm a big fan, I think it's fantastic. It has various positive aspects. Of course, it introduces accountability. We can't just run things different ways if it didn't work out the first time. And I feel like a lot of times, particularly when we're trying to train new models, there's a lot of judgment calls going into it, and it's hard to know, if you're just reading the paper, how did they come to the conclusions that they came to, how did they make these hyper-parameter choices, whatever, all of that stuff can be a bit hard. So it's nice to have some transparency, and for reproducibility reasons, be able to go back and look at exactly what the proposed contrasts are supposed to be. So that was pretty cool.

It would be neat to think about how preregistration could work in an NLP context. I like the idea. I feel like if we put it in with reproducibility, there's already a core mass of people there, so that might be a way to get more interest towards that. It's related to the whole P-hacking thing, which I think NLP hasn't even scratched the surface of it yet too.

Chris Potts:I think it's great, and I think that I'm always positively predisposed to believe things if it's noted that they're preregistered. In NLP, you get some similar effects from having a public leaderboard where you upload predictions and you're stuck with being on the leaderboard, whether you wanted to be there retrospectively or not. Is that playing the same role in our context, do you think?

Adina Williams:Hmmm.

Chris Potts:Because the preregistration thing is really like, "I'm going to do this, and then I'm going to do a whole human subjects thing. And then I'm going to have to do data analysis and the protocols are then wide open, and I don't want to have the garden of forking paths on those protocols." Whereas here, a lot of those terms are set for us, right, by the leaderboard?

Adina Williams:Yeah. That's an interesting comparison. I hadn't thought about it that way. But yeah, maybe that is a reasonable analogy for the leaderboard thing – well, if you have closed test sets.

Chris Potts:Yeah, yeah.

Adina Williams:But you know, not always. I think it would work well for the leaderboarding kinds of experiments, but I'm not sure it works so well with creating some new architecture or trying to measure bias in existing contexts like, let's say, social biases. It's hard to know what to preregister about. That would take a lot more conversation with the community to decide on how that should be measured. But yeah, I think that's a reasonable analogy.

Chris Potts:It only works if you've got your F1 that everyone is still climbing on. I don't know that I want a leaderboard for assessing gender bias in that nuanced way that you did!

Adina Williams:It seems tricky!

Chris Potts:I want to keep track of that issue that you just raised in passing about the closed test set, but before we leave this paper, another part that I'm curious about methodologically is the choice to use fastText, which is the embeddings I think used throughout the paper. And you mentioned we have contextual vectors now. Can you say a bit more about what guided that particular choice, of fastText?

Adina Williams:Yeah. It was actually a bit practical. We also looked at GloVe. So fastText and GloVe, and they went in the same direction – they were similar. But it was mostly that it's an accessible tool that my co-authors felt that they could intuitively understand at a deeper level. They weren't quite ready for the contextualized ones yet. So it was really a practical thing, though, we've been talking about how to expand in various ways – some is to contextual, but also to other things like intersectionality with race and gender and various other things. So it was mostly just so we could get a handle on it and start churning through in a way that they felt they could track. Yeah.

Chris Potts:Makes perfect sense. Yeah. I was thinking about it because in this course for the first unit, that's on word vectors, and we have a notebook around ideas from Rishi Bommasani on Deriving Static Representations from contextual ones. And that's very successful along the tasks that we posed, like word relatedness and so forth, and so I thought maybe they're better in some sense, although I had no faith that somehow your effect would disappear, because I don't think BERT is going to be better in these regards. But I did think it might help for the nuanced experiments that you did that are looking at verb senses and how they relate, because then we would have some way to capture different senses for those verbs.

Adina Williams:Yeah. That would be cool.

Chris Potts:You had a different method for that. And it would be very heavy-duty computationally given the scale of what you all did.

Adina Williams:Yeah. That was additionally part of it, but not as big of a part as the other. But yeah, that's a good question, though. That sounds like an interesting follow-up.

Chris Potts:Really, the burning high-level question for me is what we're going to do about this? Because the guiding thesis of the paper is that, in using the Common Crawl, we're getting a picture of society as reflected in this data, and then we see this thing, which linguistically looks actually pretty deeply embedded in the lexicon of, say, English. So what should we be doing in response to this result?

Adina Williams:Yeah, it's a bit unsettling. And I take the same approach as you that I think we should be doing something, but there are other people who take the approach that this is how the world is, we should just be modeling it, we're not to change the world. I'm more on the former side. One of the projects that's ongoing now that we're hoping to release soon, so you get a bit of a sneak peek, is looking at the effective demographic perturbation in text. This is a huge project with lots of caveats, but just to give a little teaser, one thing one could do is take the Common Crawl as templates, and for each word bearing demographic information, say, like "mother" or something, you could swap that to another gender specification, say, "parent" for non-binary or "father" for male, or something. And then you could have a more balanced representation of semantic context in relation to gendered words and perhaps break associations that lead to things like these androcentrism effects in word embeddings. We haven't checked that yet, but that's one of the applications we've been considering – whether demographic perturbation does break gendered association, say. So that's something we could do with the training data.

Chris Potts:That's interesting.

Adina Williams:One caveat there is that you still end up operating with the training data like Common Crawl as it is. And there are certain topics that are included and not included based on whose language it is. And that method wouldn't fix those kinds of topic-level biases, but it might be a first stab.

Chris Potts:That sounds great. The part that might run deeper is, for example, I've been trying for years to stop myself from using "guys" as a gender neutral plural for groups of people. And I've been unsuccessful at this. My preferred outcome would be that we just innovate some of these terms that are like "queens", where you could just say, "Hey, Queens," in a gender neutral way and balance this out. But since the lexicon of English is, as far as I can tell, completely missing the version of "guys" that is gender neutral but has its origins in referring to women, I feel like this asymmetry is persistent and that makes me feel guilty.

Adina Williams:Yeah.

Chris Potts:About saying "guys".

Adina Williams:Yep. Yep. There's a lot of historical baggage we have here. I would definitely be excited for people to innovate new terms here.

Chris Potts:Yeah!

Adina Williams:And English is not as bad as some languages. It's much harder in German and languages that don't even have any gender neutral reference really, or much less than English. It's complex, but yeah, I think we all can do more.

Chris Potts:In the modern Battlestar Galactica, all the officers are "sir", whether they're women or men. I love that, but the even cooler thing would've been if they were all "madam" or something.

Adina Williams:That's true.

Chris Potts:Wonderful. Yeah. So everyone should read this paper. I think this is really exciting and inspiring as a mode research, both for the preregistration part and the ambition of making these really broad claims about language and cognition.

This is a nice point I think, to transition into ANLI a little bit. Maybe a hook there – I don't want to lose track of this question – is that issue you briefly raised of having a closed or open test set. So MultiNLI closed test set. As far as I can tell, culturally, that means that almost everyone does all their evals on the dev set. What do you make of that? Am I right about that?

Adina Williams:Yeah. Yeah, you are right about that. I don't know. I probably don't have as strong of opinions as other people, but the fact that people just get around it by doing eval on the dev was part of why we decided to open up the ANLI test. Also, we had a long conversation about it with the ANLI team, and some folks on the team felt that leaving it closed didn't appreciate the community aspect as much. It assumes that somebody would want to cheat on the test set, and they were like, "Who would want to do that really? We're a good community here. We can trust people not to cheat." And so I felt like that was a reasonable argument at the time.

On the other hand, it's nice to have a closed test set. Then you can be very, very sure of which one is better. So I feel like these are hard trade-offs. I've been also glad ANLI test set is open because we can run automatic perturbation on it and just not look at it, say, and then we can do that evaluation more directly, and we can't really do that with MultiNLI. I can do that, but nobody else can, because it's not open. So that's nice, enabling additional tweaks or changes to the test sets. Nice benefit of that.

Chris Potts:One subtlety there: for this course, we have a series of bake-offs that students enter, and we moved to having them distributed with no labels. We used to distribute with the test sets, and we would talk a lot about the cultural norm we have in NLP of never over-evaluating on the test set, never doing model selection and so forth. But it's actually not such an obvious thing, conceptually. And in a way, if you just want the best system, why would you put yourself through this hassle of not getting to look at your own test set? And it was difficult enough and subtle enough as a lesson that we just moved to not distributing the labels, so that there's no opportunity. It's like a Kaggle competition where everyone finds out at the end. And that complexity does worry me about how test set evaluations generally happen out there in the wider world. And that makes me favor the closed test sets. But then I see the other norm, which is that for MultiNLI , the only people evaluating on that test set are Kagglers. And when people publish, it seems like they don't bother with the test set evaluation.

Adina Williams:Yep. Yeah, they don't. One of the things that we've been talking about in the context of DynaBench, which is a project that Chris and I have been working on amongst many other people, is making it easier to do that Kaggle-style eval so that there's not an additional hurdle to actually bothering to upload your predictions. But I don't know, I feel like a lot of people it's always going to be a hurdle that they go and upload something. But yeah, I agree. I don't have a super strong opinion. So in a sense, I went with the strongest opinion in the group on each of these.

Chris Potts:We have two chat messages that I find cute, and I want to pull them out here. So one is from Adolfo, and it just says, "Say y'all." Adolfo, yes, help me out. As someone who grew up in Connecticut, I feel a little bit like a phony when I say "Y'all." It doesn't come naturally to me. We should normalize using it throughout the United States. This would be great, and could replace "You guys".

Sylvia asks a question: "Should I send out the NLP Lunch email today addressing everyone as 'queens'?" You could try it, but this would be regarded as a breaching experiment, as they say in sociology. You might get very unusual reactions and that's part of the problem. Maybe if you do it, you should give a little bit of background on why you did it! Then I think the group will appreciate it!

Let's talk a little bit more about NLI. It's a topic near and dear to my heart. How did you end up doing so much stuff in NLI?

Adina Williams:Oh, yeah, that's a funny thing. As you mentioned, in the beginning, I'd been working mostly on cognitive neuroscience with my advisor Liina Pylkkänen at NYU. And along the way, I was involved in semantics at NYU more generally, and we had a search for a computational linguist, more an NLPer in the specialty in semantics. And they asked for student representatives to help out with that search. And so I was like, "Yeah, sure. I can help out. I'm really curious to see how those work anyway. It seems interesting." And the results of it was that we hired Sam, Sam Bowman.

Chris Potts:I've heard of Sam Bowman.

Adina Williams:Yeah. And so I had already been discussing things with Sam in the context of his search, and when he came, I was very excited. And I think he was teaching the NLU class that first time, and I was sitting in on that, and he was looking for people to help him out, semanticists in particular, with a large scale NLI data collection. And I was like, "I'm a semanticist, I do a lot of data work as my CogSci background. If you're interested, I'd love to help out." And then it just got hooked on it. We created MultiNLI with Nikita Nangia, and the rest is history!

I've been very interested too in figuring out the small details of NLI. I feel like a lot more work went into understanding particular types of examples, particular types of failures, and the earlier history, maybe RTE, but then there's been maybe a lull where everybody's just like, "Oh, I'm super familiar with this. I don't need to really ask all the hard questions about those specific examples. And so I'm very excited to move it back towards that more zoomed-in questions about NLI."

Chris Potts:Do you have a spirited defense of NLI from the point of view of linguistic semantics? Do you regard it as a sensible task?

Adina Williams:Yeah. I generally start with Wittgenstein. I'm like, "Look, there's tons of history and logic that has talked about propositions and entailment and all these things." And NLI is the closest version we get to that which is also actually performable by naive participants to a reasonable extent. So in a sense, it's like, I think that one of the best theoretically grounded tasks in the NLU canon, whenever that might be. That's also nice for me. Maybe it's more of a aesthetic thing, less of a strong knock-down argument.

Chris Potts:I feel the same way, which is: it is the task in NLP that is closest to the way semanticists think. Because it's really just context-dependent entailment and contradiction. And from that perspective, I love it. Because it's like the true nuts and bolts in the way that Ido Dagan imagined. Like, this is the heart of the heart of reasoning and language. It's not a naturalistic task though, compared to question answering, for example, which feels like something that's just a first-order human capability – ask a question, get an answer. For NLI, we have to train people, and I'm not quite sure what to make of that split.

Adina Williams:Yeah. I've also been thinking about the relation between the two for a while. I had a conversation with Sebastian Riedel, who's a colleague of ours at London, at MetaAI London. And he was like, "Oh, they're basically the same. You can just recast them to each other. Therefore, whatever, pick whichever one you like." And I'm like, "I don't know. I still think they're different in various ways." Sure, both have propositions and things like contexts, – you could think of the question like a context – but thinking of them as the same is losing something about NLI.

Chris Potts:That's funny, though. Because the thing semanticists do, which is like, "Oh, I can have entailment and contradictions for questions. And in fact, many of those formal theories support those notions with question answering." And so then if I feel like, "Oh, Sebastian is absolutely right about this." On the other hand, it's again, not very naturalistic, this idea that you would reduce it to entailment. Again, I feel that push and pull.

Adina Williams:Yeah. Yeah. But it's a fun conversation. I'm always excited when people ask those questions like, "What do you think? Is it the same?"

Chris Potts:One thing I think we cannot deny is that it has played an important role in NLU over the last 10 years. Because it led the way on adversarial testing, people thinking about artifacts and gaps in the data. I feel like before SNLI, for better or worse, people were not thinking about these issues. And so it led to this outpouring of work, especially on adversarial testing. And maybe you took that to the limit with Adversarial NLI. What's the origin of that project?

Adina Williams:Yeah. So that project grew out of Yixin Nie's internship with us. I think he was mentored by Jason Weston and Douwe Kiela, who you guys met last week. And Yishin was very excited about NLI in general, he had done some work on NLI stress tests. He created this stacked autoencoder, I think, that was supposed to be the next best model. This is right before the big Transformers. And we had been talking about this challenge that Jason and also Emily Dinan had run called Beat The AI for chat. They were trying to get a chatbot that was more exciting, more conversationally, engaging than a human. We're like, "Well, what if we actually apply this idea to classification?" And ANLI basically came out of that connection or confluence of those various things.

So it grew out of that general "Build it, Break it" challenge idea, Allyson Ettinger's paper around that time as well. She had a workshop on Build it, Break it. I don't remember what it was called now, but it was an exciting moment for this work. Also, MultiNLI was getting saturated, the GLUE tasks seemed to be getting saturated, and we're like, "Oh, now what? What do we do? How do we measure progress?" Because I wasn't comfortable with saying that the models are doing NLI in a fully human-like way, and I wasn't fully convinced. I still think that they're not quite at human-like levels, despite various benchmarks measuring that they might be. And so I wasn't ready to throw in the towel yet on that.

Chris Potts:Right. Again, tremendously eye-opening in my view. So the way it worked – let me know if I get any details wrong – but it's like, we're going to have a round of example creation with crowdworkers where they're trying to fool a model. Then we'll train a new model on the examples we created, plus maybe some more, and do the same thing again. And then one more time. And so that by the outcome of this you have a bunch of examples that, even through those successive round, you still don't have a model that can solve them. And as far as I know, ANLI remains the data set that has the largest gap between our estimate of human performance and our estimate of model performance, right? Has there been any breakthrough?

Adina Williams:It's been a little, but the progress is much slower than we might have seen for GLUE. So GLUE performance is like a year or so, and the whole benchmark is saturated by ANLI. It's been three years now. We're still at like high seventies-ish.

Chris Potts:Oh, is it that high? Okay, that's reassuring.

Adina Williams:There is progress, but it's not like 95. There's still good additional space there. And Yixin's actually working on a fourth round now.

Chris Potts:Oh, great.

Adina Williams:So hopefully he'll be able to share that shortly. I'm also working on that, but I don't know exactly where it is. The progress is there.

Chris Potts:Yeah. We did DynaSent with support from Facebook, and I think it's a similar story. Douwe evaluated all these fancy new BERT-based models, like DeBERTa and stuff, and they're all just hanging out around what numbers we put in the original paper. And we used very similar protocols to what's in ANLI, with the small change that for one of the rounds we just harvest examples that we knew would be hard based on other things happening in the context.

Adina Williams:Mm-hmm (affirmative).

Chris Potts:Yeah. There's something to this.

Adina Williams:Yeah.

Chris Potts:It's very difficult. And it's interesting. I just posted a project idea in our discussion forum for this course, which is just: Pre-BERT – you alluded to this – pre-BERT we had all these adversarial tests for NLI came out. Models were stuck. And then some of them have fallen. RoBERTa, out of the box trained on MNLI, solves the "Breaking NLI" data set from Glockner et al. But I don't know that anyone has done a thorough survey of this. Do you?

Adina Williams:No. So this is one of things we were hoping to do in the future for ANLI, in particular. We're hoping to get all of those data sets and upload them as eval data sets onto the actual leaderboard, so that whenever anyone had a new model, they could basically evaluate on all of these breaking NLI data sets, HANS, etc.

Chris Potts:Oh, yeah, yeah.

Adina Williams:It's still in the plans, we just haven't gotten it together yet. It's a lot of coordinating with people and stuff.

Chris Potts:Maybe some students in my course will take care of it. But here's a predicted key difference. For the tests I'm talking about, this was a lot like changing synonyms, changing hypernyms, swapping subject and object with an assumption that it would change the label in some way, testing them for equality – synthetic analytic tests, where there was one high-level idea. And I think those adversaries probably have already fallen with new models, even starting from RoBERTa. But I bet it's not the same for ANLI, because that was a case where human creativity was just doing its free-form thing with no centralized plan, except fool the model. So I bet there's still a split there.

Adina Williams:Yeah, it would be good to check. Maybe within the next six months we can get that one. Or one of you guys can do it and let me know what happens. Because I would love to find out. There's also been another strand of things I've been thinking about with the more synthetic data sets, because a lot of them have fallen, but it's not clear exactly why or what's going on there.

I dug back into some of the Linzen et al. test sets for subject like auxiliary agreements. And there are some really strange factors there about the perplexities of these things. I think they have to do with the length and various other things that I don't think have been adequately controlled. So making test sets synthetically with these adequate controls, I think it's still an open research direction. How exactly to do that, I think.

Chris Potts:Right. No, I just think these questions are so important to be posing, just because it's led to so many illuminating things about the nature of the progress we've been making. Yeah.

And this is a nice transition because DynaBench, which you're essentially involved with, open-source project to take this to its limit, right? Lots of adversarial train/test data sets. What are the origins of DynaBench, or least your involvement? I'm not sure of the full story.

Adina Williams:Yeah. So it basically grew out of ANLI, once it started working there, Douwe especially got very excited and was like, "We should do this on more tasks." And he started talking to you and Max and the folks at UCL about QA. And just along the way, he and I ended up co-leading the DynaBench research platform's development. And it's been huge, there's been tons and tons of collaborators, over 30, across tons of universities, different companies and stuff too. And the most exciting thing is that we're actually transitioning DynaBench to a nonprofit called MLCommons, for Machine Learning Commons, which is funded by not just Meta but a bunch of other companies in this space, so that there's not so many weird company-based restrictions on using the tools. So it's more like community-based and anyone can use it regardless of whether you're in academia or Meta, you can use it if you're at Google or anything else. And so that's actually coming through by the end of the month, and that will enable anyone to run tasks wherever they are, basically.

Chris Potts:Oh, wonderful. So how will that work practically? Suppose I was in this course, and I wanted to run a task, to give my system a hard time or create some data. What's the path to doing that?

Adina Williams:Yeah. So we have a weekly meeting, that's a decentralized lab meeting, and you would show up and say, "Oh, I have this task. I'd like to run it." The majority right now is NLP, but we're open to anything. Majority is also adversarial, but we're open to collaborative. All of these things are doable. And you would just pitch your idea and say, "Here's what I want to do." And then the engineering team would talk about whether they can support it or how to get funding for that and various things. Then we could just set up and go. Most tasks require basically no new features, and so you can just basically start immediately. Just request and do a PR, there you go, start your task.

Chris Potts:I love it. And what is the collaborative idea? How would that work with a model?

Adina Williams:So there's been some rewriter-type experiments that some of the QA folks are working on right now, where the model might seed you with a question and you could adjust it in various ways to be more grammatical, more concise, and then you can feed that into a different model. So basically there's two models here, a helper model and maybe an adversary or the evaluator. And so that's been cool with generators. You could also pair up humans and humans. That's something we've been considering. There's other work, like Nikita Nangia is exploring this in Sam Bowman's lab and actually creating examples. I think it was for QA, but might have been ANLI, with two humans collaboratively trying to write the example and convincing each other of what the most exciting, possible hypothesis might be.

Chris Potts:Oh, that's so eye-opening. So maybe the actual guiding idea of DynaBench is this interactional component with another agent. And that's the thing that leads to better data creation?

Adina Williams:Yeah. So that and dynamic. Dynamic rounds of data collection are our two main distinguishing features.

Chris Potts:Can you expand on that? The dynamic part? Yeah. That's right there in the name.

Adina Williams:Yeah. Part of the hope of Dynabench, the research project and platform, is that we'll be able to speed up the ecosystem of benchmarking. So the way it generally works is you have a benchmark, let's say, it was MultiNLI or SNLI or something, and then everybody tries to beat it. They create models, train them in special ways and have breakthroughs. And then slowly the benchmark starts to saturate. And then historically people would be like, "Oh no, it's saturating," and then they would take a year to create a new benchmark. Then at that point, release it, and then everyone would scramble with activity on that.

We're hoping that dynamic rounds of data collection would mean data collection never stops. People are always collecting data, regardless of whether it's completely saturated. And we don't have to have these weird lags where you need a year to create a data set. You're always working on it. So people can seamlessly both try to evaluate models and create new data sets in a quicker progression. So that cycle is not as slow as it historically has been.

Chris Potts:It sounds great to me. I think that's so important. Another problem it would address is the like the community spending years getting epsilon more performance out of the same old data set and ranking systems on which you know is a little bit of overfitting and a little bit of noise. The amount of wasted effort and compute for these things. And you could just diffuse that by changing the game, right?

Adina Williams:That's the hope.

Chris Potts:Wonderful. Sterling, I want to switch gears a bit and find out more about Adina's past, but before I do that, are there student questions or questions that you want to pose?

Sterling Alic:Not at this time, but inviting the audience to, please, if you have any questions, send them over.

Chris Potts:All right. Could we rewind a little bit, Adina? I'm curious to know what it's like to be a neuroscientist. I have no knowledge of this domain.

Adina Williams:Yeah, sure. It's fun. I don't find there to be that much difference in some ways between doing neuroscience work and doing evaluation of neural networks. In the neural networks case, you can actually sever more things and stuff and see if it breaks, but there's a similar at least broad analogous shape to the kinds of problems that you ask. In both cases, there's something that you don't really understand generating an output to an input. For at least neuroscience, the way I was doing it, I was doing magnetoencephalography, which is a non-invasive measurement technique for basically capturing electronic fields as they extend outside your skull.

So you're thinking, your neurons are firing, and electrical fields are being created. And depending on which ones are firing, the fields will be different in strength, different in location, and they do extend outside your skull, so you can just basically collect them with the magnetoencephalography machine. And so that would be the output – these electrical fields. But the input – at least, I was working on vision – would be some words on the screen. People could read like, "Red boat," or just boat or something. And say like, "What's the difference between boat and red boat?" One is a more specific concept. (This is a classic example from my advisor.) One is a more specific concept, and the other is that the boat is more specific and red boat is a type of boat.

Is there a difference between an individual person's generated fields in reaction to viewing these things. There's a bunch more complications with neuroscience in terms of time. At least with magnetoencephalography, people might spend up to 400 milliseconds thinking about boat or red boat, and there's various processes ongoing. Seeing the shapes of the letters, realizing that there's a word, figuring out what the meaning of that word is. And they're sequenced in time. That's different than the way our models work now – it doesn't have a straightforward time dimension in that sense. So the broad strokes is interesting, how does this black box work, how does it react to the inputs we give it, and what does that tell us about its internal workings.

Chris Potts:The black box being the human brain? Mind, brain.

Adina Williams:Yeah. Human brain or the RoBERTa model. Either one!

Chris Potts:That's right. Oh, wait. So does the term "neural network" make you wince or are you perfectly happy with this?

Adina Williams:They definitely don't refer to the same thing. I've gotten used to it.

Adina Williams:It's just two senses.

Chris Potts:Two senses? Right!

Adina Williams:Yeah.

Chris Potts:Your interest in neuroscience goes all the way back to your undergrad work, right?

Adina Williams:Mm-hmm (affirmative).

Chris Potts:And then, so I'm guessing you went to NYU specifically to do this with Liina?

Adina Williams:Yep.

Chris Potts:And then do we have Sam to blame for you becoming an NLPer or what's the deal there?

Adina Williams:Well, both, yes, simply, yes. But also the practical day-to-day for neuroscience is much slower. So NLP is pretty quick. You have an idea, you can just test it out. You can get a nice answer that you feel confident about often, and then you can publish it, present your work, get feedback. That cycle is really quick. For neuroscience, it's like three years on one paper. And so it's also just a personality thing – I wanted to move quicker, have answers quicker. And I felt somewhat sometimes stuck. I was like, "Oh, my God, I've spent three years on this. What if it just doesn't work? Or no one likes it, or at the end of it, I for some reason find something out that I don't believe it anymore? Like, Ugh." It's just a lot higher barrier that you have to overcome or a lot more. I found it more stressful. It's like, "Oh, three years is a long time to be doing one thing."

Chris Potts:Sure.

Adina Williams:So part of it was personality, I think.

Chris Potts:Are there things that you learned as a neuroscientist that you apply in your NLP work?

Adina Williams:I think one of the things you learn in neuroscience is to create a really well-designed experiment, and consider a bunch of possible confounders. And I think that skill is very useful in, let's say, trying to figure out how models work in the NLP side – how do you actually create a good comparison, like a strong baseline. But how do you actually create a well-founded experimental design. I feel like that was definitely brought over.

Also, and just generally, I like Meta AI. People are very interested in how humans process concepts, how they speak, all this stuff. And people have very little familiarity with any of the actual research on this, and so I fill a happy niche where I can be like, "Oh, my gosh, you should read this paper. This is totally what you want. It's in psychology and you'll love it." So that's been another plus – bringing those cross-disciplinary or interdisciplinary connections together, often really helps people get inspiration for new projects and come up with new ways measure their stuff or train their models or whatever it is they might be doing.

Chris Potts:Sure. Your mention of the time it takes the human brain to do things, and that being important, that got me thinking. In your view, is there anything about the human brain that would mean that neural networks are just never going to achieve what the brain achieves because it's some physically instantiated biological entity?

Adina Williams:Oh, I don't know. But I can bet in one direction or the other.

Chris Potts:Oh, I love a bet.

Adina Williams:At least for now my bet is the way that we're doing things will not reach full human, like AGI. Maybe I'm just a skeptic. Maybe it's possible that we can learn everything from data with completely different, innate constraints. But I think we can't. I always have a fear though ... I have these conversations with people that we're playing around with the definitions of particular terms like, "Does this count as human-like?" I worry that we're really squishy about these terms when it helps. And so that's hard to really know: what will the model have to do for us to think that it's doing equivalently well to the human brain. And I think that's hard. It's not decided. People will vary. Some people are like, "Yeah, we're already there." Or, "Oh, it's super human on this." Even though that's more than human-like or whatever. I don't know. I'm just skeptical.

Chris Potts:I share all those concerns at some level I think. Although, I'm open-minded. But what about when you think about the human brain, which I know nothing about, but you do. So when you think about the human brain, is there any aspect of it that makes you think, "Unless our machines start doing this, distributing chemicals in a certain way, they're going to be non-starters for cognition"?

Adina Williams:Yeah. I'm a fan of the consciousness argument. We don't know what that is in the human brain side either, but I think that unless we have models that have something like consciousness or something like conscious goals, we're not going to be very close there. Because people generate speech or text for reasons. They want to convey something. But the model doesn't have any goal, really. At least, the current models are just generating because that's what they think the most frequent likely outcome. That's the most predicted likely of outcome ought to be. So that's the one that I'm hoping for. Until they get consciousness, I'm not going to necessarily be on board.

Chris Potts:But the goal could be an emergent thing, right. And so my confidence there would only start to waiver if the real origin of goals is some very primitive biological reward signal that does come from chemicals spreading in our bodies and causing certain physiological things in us. And that unless we actually bake those into a neural network, it's never going to know what it means to feel pleasure and pain and therefore never achieve real cognition. But I wasn't trying to put words in your mouth. So none of that resonates with you as a neuroscientist?

Adina Williams:Maybe it's a bit tacky to reference Marr's Levels – everyone seems to do this. But I take more of a Marr view that we can abstract away from the actual implementation. But yeah, that's a good point that it won't know what pain is unless it's experienced it. And I think that that's a reasonable argument as well. Maybe it's just that those kinds of concepts aren't the ones that I'm most excited about. It's in the really abstract, like can it learn "and". Can it do a reasonable job of the multiple different meanings of "and"? That's functional stuff I'm interested in. So, yeah. Temporal precedence maybe. Can it learn temporal precedence? That time has an order?

Chris Potts:I'm glad you mentioned "and". "If", "every", "most" – semanticists' best friends, formal semanticists. That prompts me a question that I wanted to ask you which is: is this a good time to be a linguist working in NLP or is it a low point? What's your feeling?

Adina Williams:I like being a linguist in NLP. I think people are very nice and a bit more inclusive of formal linguists than I've heard reports that they used to be. So that's nice. People are starting to re-appreciate linguists. They're not just like, "Oh, yeah, all we need is more data. That's going to solve it." There's always some people like that, but I think the number of people who do want to hear about the domain knowledge that linguists have is increasing or is high. I feel very included and appreciated in this. So it's nice in that respect. Socially, it's nice.

I also think there are some interesting challenges to things that we have assumed as linguists. So the fact that our models are beating these benchmarks for things like SNLI, so handily ... Well, maybe not ANLI, but the previous ones. I think that does present some challenge to linguists, particularly in maybe the Chomskyan linguists, who have certain assumptions about the way humans acquire language and what it means to acquire language and all this. And I think these questions are more ripe now in a sense, because the models are doing better. So there's more work for a linguist to do. So that's pretty cool.

Chris Potts:Oh, I completely agree. Because then my question has two sides, which is, are the NLPers being nice and receptive to ideas? But the other part is just that, as a linguist, is there any interest in looking at NLP for you? And I would say boldly that, until recently, the answer was no. For example, the biggest and best n-gram-based language model might have had engineering value, but it was not going to teach linguists anything or a psycholinguist for that matter. Whereas GPT-3, for the reasons you mentioned, has some intrinsic interest as a device that learned only in a data-driven way to do some things that look like language. And so now I feel like I want to tell my linguist colleagues like, "No, pay attention to these developments. This could be important for you."

Adina Williams:Yeah. I think one trick there – I talked to Marco Baroni about this, getting linguists excited about NLP question – and I think some linguists are just all or nothing. They want every phenomenon, every example to be perfectly generated, perfectly correct before they take notice. And I think those people, it's going to be hard to convince them. But another way that I've been trying to excite linguists is to measure things using NLP tools that they might be interested in.

So Richard Futrell has been doing a lot of cool work on this, from an information theoretical perspective I've done a bit, Ryan Cotterell as well. Trying to get at like, "If we use the text and, let's say, Common Crawl or whatever, as a sample, can we find interesting information-theoretic correlations between different quantities?" The one that we did most recently was gender and the meaning of nouns. So does the embedding of the noun tell you something about the gender class that it's likely to fall into, even in multi-gender systems, and this sort of thing? Which is more about the phenomenon of language itself, the meaning of various pieces. I think that excites a certain kind of linguist and gets them more involved, since that's about the actual phenomenon they study. But I think they should take those in either case.

Chris Potts:In all seriousness, the reason I started doing NLP, or something that looked like NLP, was that I wanted to know what swears meant. And the only way I could figure that out was by looking at a lot of data and the best toolkit for that is the NLP toolkit. And then I was off and running. Yeah.

Adina Williams:Yeah. Cool.

Chris Potts:That's cool. A related question that was on my mind, because you've done a lot of work on cross linguistic stuff, especially related to morphosyntax. Is it a low point for doing that thing? Do contextual models make it unnecessary to have a good morphological parse?

Adina Williams:I think in the sense that the generations are so good that they're able to do everything surprisingly well, but I think there's also an opportunity for making better tools in a lot of these languages that don't exist, and an opportunity for controlled generations, say, of different kinds of sentential variants – that I think is really exciting. So can you have a sentence with particular tense and you want to rewrite it as a different tense, how do you get ... at least for systems like Georgian that have huge parameter space. How do you train on a small amount of examples and actually get it to generate all of the possible combinations of morphological features? This is not solved. And they're really high combinatorial spaces. They're really big combinatorial spaces. There's some cool work to do there, for sure.

Chris Potts:Well, that's a great example. Again, I feel this tension. On the one hand, there's a lot of multilingual NLP happening now. And a lot of it is as a result, for example, machine translation being good, so we can heuristically translate and also having multilingual embeddings. But the misgivings come when you observe that like, "Well, all these systems perform less well than they do on standard English." And that's evidence that not all these languages are on even-footing. And then I worry that it's going to result, though, in people not doing focused work on Georgian. Because they're like, "Oh, my multilingual BERT will do something, and isn't that enough?" Yeah.

Adina Williams:Yeah. I don't think it is enough. I think we need more. I'm sure you agree. There's some cool shared tasks going on right now for SIGMORPHON and some of these sub-tasks like morphological reinflection that are really interesting. They're also doing one this year on nonce words, which is cool. Multilingual nonce – how does the model handle multilingual nonce, and that's another exciting direction the morphological space.

Chris Potts:What kind of nonce word?

Adina Williams:So they generated them on a bilanguage basis. So they generated both nonce verbs and nonce nouns with task owners of those particular languages. And then they're trying to inflect them all and stuff.

Chris Potts:Is it like "Corvette your way across the USA" or is it like "hangry"?

Adina Williams:It's like "blicket" or whatever.

Chris Potts:Oh, completely?

Adina Williams:It's completely nonce.

Chris Potts:Oh.

Adina Williams:Yeah. But phono-tactically listed for the particular languages. They also have some human experiments to vet them and stuff. I think they did how likely is this to be a word in your language or verb in your language.

Chris Potts:Oh.

Adina Williams:And so it's pretty cool. I like that one.

Chris Potts:And so is part of this to see whether these big modern models implicitly generalize the way people do to these unusual inputs?

Adina Williams:Yeah.

Chris Potts:I love that. That's again, a question that was pointless to ask even 10 years ago. You're going to ask your unigram-based classifier whether it recognizes a nonce feature? The answer is no!

Adina Williams:Yeah.

Chris Potts:What about these new models? That's cool. Yeah.

Adina Williams:Yeah. It's very cool. They don't do perfectly so far, but now it's a shared task. Maybe somebody will do well. The results should come at NAACL. So keep your eyes peeled for that.

Chris Potts:And what else could we be doing to further multilingual NLP? What are we going to do? There's some inherent roadblocks, like it's expensive to create MNLI for Georgian, right? What can we do?

Adina Williams:Yeah. We had another NLI data set on this multilingual stuff. I was a very middle author on this one, but Alexis Conneau was leading it.

Chris Potts:Oh, yeah.

Adina Williams:On trying to transfer both to languages that you have some fine-tune data on, and languages that you don't, so trying to bootstrap from languages to others in their family and various other things. So one could do this in-between translate – use an automatic translator to translate test sets and then use a human to translate. Basically you can get far on automatic translation, in terms of measuring multilingual ability. But I think at the end of the day, we are just going to need to pay the money and include more people who are experts in these languages.

There's a new effort from this group at Meta AI that's called FLORES. I don't remember what it stands for, but it's a largely multilingual effort and they're trying to do a shared task for WMT on particularly African languages. So they're taking ones from the Horn of Africa, basically all over. They're sampling some and trying to get enough data to actually have a reasonable translation task on these historically underappreciated language families.

Chris Potts:Oh, very cool.

Adina Williams:So I think having companies just foot the bill is probably the best way to make progress there.

Chris Potts:And hey, you just reminded me that I misspoke. So you're also an author on the XNLI paper. Another major paper in NLI and a big milestone in terms of pushing people in exactly this direction. It's all well and good that you can NLI for English, but what about doing it in Turkish and I forget the 14 or something other languages that are in that data set. Wonderful. And that's a nice thing from a modern perspective, because it's just dev and test, right?

Adina Williams:Yep. Yep.

Chris Potts:So that's pushing people to think creatively about training maybe on unsupervised data, maybe few-shot, and doing well on the evaluation. And if the whole community moved there, it would be much less expensive to create an NLI benchmark for Mongolian or whatever the language you wanted to look at was.

Yeah. Sterling, yeah, you've got some questions?

Sterling Alic:Yes. We've got some great questions. First one that I have here is asking about, "How much would cultural differences influence the creation of cross-linguistic NLI data sets?"

Adina Williams:Yeah, that's a great question. I don't think we have a straightforward measurement, but this is a question that some work on Chinese by Hu et al, 2018. I think they did a good job of laying out the fact that this is a problem. They created a Chinese data set from scratch in a Chinese context, really focusing on whatever cultural topics that their annotators felt were important to include. And they've leveled the exact critique that you're pointing at against XNLI, saying, "We really actually need culturally competent data sets collected in a source language for all of the other languages we might need."

How much of a problem it is, is still an open question, but I'm sure that it's at least a somewhat of a problem. So that's a good question. It would be great if we had both. I think we can do both – have these translation-style test sets and also have actual data collected with the culture of interest from scratch. So I think both together will get us a good way there.

Sterling Alic:Awesome. And then we have another question from a student who has a lot of interest in language acquisition in early childhood. And they say that there are several people working on Curiosity in AI attempting to create systems that can learn as children do with minimal labeled data and a few-shot approach. Are there any influences in the neurological or the neuroscience aspect of children learning that can be applied to models, and how do you feel about the possibility of creating language systems and models that can learn on their own as children do?

Adina Williams:Yeah. This is a huge question. This is actually something too that Yann LeCun has been talking about or thinking about a lot. It's not just the amounts of data that separates humans from models, although it's definitely part of it. But the interactional component, you're right, is super important. They put their feet in their mouth, they do all sorts of stuff, they touch things, smell things. I think it's a very exciting approach to have models learning through interaction as children do.

We've had a workshop called "Learning to learn through interaction", I think. I think two instances now, but there's one more coming up, that asks similar questions. But nobody has done a baby-approach in a way that I've been convinced by it. One issue is that you might have to solve the hard problem of embodiment first – how do you get the model to see, or play with the affordances of things to learn language from that bootstrapping. I think we might be getting there soon though, because multimodal modeling is also maybe the hottest thing right now. I don't know if that's what you guys think, but I've been getting all sorts of interactions from people on this. Lots of questions, lots of excitement. So I imagine this is only going to become popular in the next year or so.

Chris Potts:But I have to ask then Adina ... I love that. But, in your view, do I need a robot arm on my language model to actually touch things and get sensor feedback or would it be enough to have just a stream of symbols representing that sensor information and then do a language model style training as usual?

Adina Williams:Yeah. So I would say we don't probably need the sensory arm, because I think that the language module is more modularized than people generally say, which means basically language is largely a standalone module in the brain. There's a lot of arguments back and forth. So that's just my take on it. But that means that the interactions with sensory systems are more restricted. So they don't necessarily completely pervade and completely intermesh, but there's an interface between vision and language, between vision and touch, that are all distinct. So I think we could probably figure out a way to do that. I'm not sure what the input would look like.

Chris Potts:Let me give you another example, because Douwe mentioned this. I haven't read the paper, but apparently Douwe has a paper on an olfactory bag-of-words model.

Adina Williams:Right. Right.

Chris Potts:So let's suppose you gave a bunch of sensor inputs that corresponded to the smells of things to a model, in the context of it learning from lots of other symbols, language, maybe other sensory inputs come in, but those are included, very rich stream of them. It seems to me in principle possible that the model would come to understand what that sensory input meant. And in some sense, come to smell, despite having no nose. If you gave it the sequence corresponding to the smell of a cup of coffee, it would behaviorally be just like us, but of course it would have no nose. This seems in principle possible to me. Do you share that view?

Adina Williams:What would the coffee smell embedding be like?

Chris Potts:I don't know, whatever they did in that paper, a bunch of chemical symbols, but maybe it would even be more low level than that? But whatever is ... I don't know, something that was in a homomorphic relationship to the actual things that come in our nose. I don't know enough about spell to be smart about this, though.

Adina Williams:Maybe. It would be interesting to try. I've always wondered if he would recap that olfactory experiment.

Chris Potts:Or it has no arm, but we give it sensory things that correspond things being firm or soft. And then when you feed it a bunch of symbols, it says, "Oh, that's a soft thing." It has no arm. It has no sense of touch, but yet ... it does!

Adina Williams:Yeah. It's interesting. It's like it thinks it does. It's getting very philosophical. But maybe. I think that we could maybe create a brain in a vat model that has these sorts of inputs that are coded as a touch versus something else. It'd be interesting.

Chris Potts:I think the brain in the vat is the limit of the thought experiment that I'm trying to run.

Sterling, do you have any more questions to save us from this philosophical morass?

Sterling Alic:Yes, I do. We might have already addressed some of this earlier, but I still want to raise this anyway. You mentioned earlier that you're excited about the zoomed-in questions of NLI. What are some exciting future directions of those zoomed-in questions for NLI research?

Adina Williams:Rough! There are many! I'm really interested in scope ambiguities right now. So I would really like to see some interesting NLI work on scope ambiguity. The string looks exactly the same, something like "A cat is in every tree" or something, and the question is, "How many cats are there?" Is it the case that every tree has exactly one cat in it or do some trees have multiple cats? What are the different options here? And how do you then get the right entailment relations from something that's ambiguous? Which ones do you want to go with? So in general, I've gotten excited about ambiguity in NLI, because I think there is a reasonable amount. There's a lot of work from people like Ellie Pavlick and Tom Kwiatkowski and also Yixin Nie and Mohit Bansal and some others on human disagreements in NLI and whether they can actually be resolved or whether they're deep disagreements.

One that Ellie's paper picked up was, "Is a barbecue a machine?" And that's just like, people vary. Some people are like, "Yes, the barbecue is a machine." Some be like, "No, it's something else." Some other kind of tool. And so that's something you can't really get rid of, that has to do with maybe somebody's language variety they speak or something else. But there probably ones that are reducible. So ambiguities that are reducible. I've been curious about that. How do we deal with those ambiguities? Remove the ones that are reducible if we can, and measure the ones that aren't. And just generally have an NLI model that's more robust to that. That's a tricky thing.

Chris Potts:What's the right label entailment-wise for, "Every student chewed a piece of gum?"

Adina Williams:What's the-

Chris Potts:There's a normal scenario where everyone gets passed around a different piece from a pack in a disgusting scenario where they share the same piece of gum.

Adina Williams:Where everyone is chewing some ... share gum? Both are valid, right?

Chris Potts:That's why I was asking, yes. Because would you want the data set to reflect all readings or would hope that context you created disambiguated, maybe because of the commonsense. The commonsense would disambiguate in one direction, right?

Adina Williams:Yeah, yeah. That's interesting. I was thinking more about context disambiguation. And what kind that would be. And so, yeah, you could have an underspecified context, but I was hoping that you could actually specify the context somehow. Because it's easier to read into the model and stuff.

Chris Potts:And that would be lovely because you could then probe a model to see if I change the context or I changed the noun phrase, does it get the predictable effect on scope? And I feel like there's a whole playground in there, where you could both artifacts, where it had just memorized that this now tends to take narrow scope and then fool it with a subject where it goes wide, but also just see whether they're actually solving the problem. Seems great. So you mentioned people are working on this?

Adina Williams:I'm having an intern that's going to work on this.

Chris Potts:Oh, wonderful. That's great.

Adina Williams:Yeah. Her name's Jennifer White. So hopefully we'll have something exciting to share in a couple months.

Chris Potts:Very cool.

Adina Williams:But if anyone else is interested, happy to talk about it. I think there's a lot of space there indeed.

Chris Potts:Sterling, other questions before I switch gears a little bit?

Adina Williams:None here. We can switch gears.

Chris Potts:All right. So, Adina, I'm just curious to know about your life as a researcher at what is now called Meta AI. And I assume you're all getting used to actually saying that. Because I can tell that you're taking it seriously. The name did change. I don't know how or where to start, but if there's a typical day for you, what's a typical day like?

Adina Williams:Yeah. So, I'm not sure there's a typical day. But I'm happy to talk a bit more about what we do.

Chris Potts:What about today? What'd you do today?

Adina Williams:Oh, today. So Wednesday is the no-meeting day, which is maybe a typical week thing. And so everybody blocks off their calendar to do things like write talks, create their drafts for papers, set aside time to run some of their experiments. So Wednesday is a free day where you catch up on all the stuff you had to do from the week. But generally the other days have a good amount of meetings. Mostly those are just research chats about particular projects. I'm a research scientist, so there's usually one or two research scientists in a meeting. And some number of other folks, probably a research engineer, maybe a resident or an intern, someone who's learning the ropes. And we will basically plot out experiments, plot out questions and do research most of the time.

There's also some other things that we do, like reading groups. It's very similar to academia in some sense. So we'll come together, read a paper, talk about it. What do we like? What didn't we like? Or people will present their work and we'll give them some feedback or ask them questions. So there's a good amount of that. There's a lot of free coffee, which is very dangerous. You end up drinking a lot of free coffee. So that's nice. But yeah, I think mostly it's pretty similar to the day-to-day of an academic.

Chris Potts:But wait, I'm having a complex reaction, because what you described is all the best aspects of my job, the part that I dream about, but it is sadly not the life I actually get to lead. So can you reassure me, did you actually hide a bunch of other boring stuff that you have to do that fills 45% of time or am I in the wrong line of work here?

Adina Williams:There is some boring stuff that feels a good part of the time. I don't know if it's quite 45. One thing that I miss is I don't teach anymore. I can. Occasionally, I'll teach a summer school course or something or give guest lectures here and there, but I don't teach any courses. It's a bit sad. So that's a one thing that's better about academia.

The boring things – we do have things like faculty meetings occasionally. We've been talking a lot about the trend of releasing large models and how one can do that in a responsible fashion. So I've had many, many conversations about that. What do you do in the beginning to make sure that thing is safe? How do you like transparently discuss those efforts? How do you encourage people who are outside to audit big models? And this is an ongoing conversation that takes up time. And everyone's very opinionated as it could get a bit heated.

Chris Potts:I would love to have that be the topic of one of my faculty meetings. I assure you, we talk about much more boring things almost the entire time.

So wait, so how do I get hired onto your team then? Or this is really for students. Yeah. What do you look for in people who are going to join your group?

Adina Williams:Yeah. So we are always looking for interns. So if you're interested in doing an internship with us, you can just go to the website, submit your CV, and those are looked at, everybody's gets looked at. We get routed people who are relevant. So we're always looking for interns, and we get a huge crop of interns every year. It'd be great to have some of you guys, if you're free. Other than that, Meta AI – the broader AI group – is hiring. I'm in the exploratory research sector, which does a lot of this research stuff, but other parts of the group do support products, innovate new products, and that sort of thing.

So the whole thing is hiring. Part of the Meta rebrand is to try to kick off this Metaverse thing. We don't know what it is yet, because it doesn't exist, but people are very excited. And so there's a lot of hiring going on in NLP and in other modalities like vision and stuff to figure out what that's going to look like, how it works, hammer out the details there. So yes, we're hiring a lot. Basically at most levels. From IC3, which would be entry-level engineering, basically all the way up. There's open openings.

Chris Potts:Is everyone on your team in New York?

Adina Williams:No. Meta AI has various different locations. There's New York, Montreal, Menlo Park, Seattle, Paris, London, Pittsburgh. I'm probably missing some. But yeah there's some around.

Chris Potts:In the morning you put on your VR headset and you meet with all these people in the Metaverse now?

Adina Williams:Not quite yet. But maybe, eventually. It's just VC for now, like this, basically on zoom. But yeah, a lot of my colleagues are in New York. It's easier to collaborate with folks that are nearby, but there's no restriction on collaborating with people from anywhere.

Chris Potts:And are you all back in the office now? I know you're going in for coffee, I guess. Is it mostly still remote?

Adina Williams:Yeah, we all returned to the office, at least in New York, perhaps also in some of the West Coast offices. And people generally go in at least three days a week. And it feels very normal. Actually, we don't even have mask mandate anymore. Which: pros and cons, but yeah, it's feeling much more normal now. So people are actually in the offices, hanging out, having in-person meetings, the roof opened up again, which is exciting. There's a garden and you see New York from. So it's been good. Yeah.

Chris Potts:Sterling, are there any other questions? Yeah. Go for it.

Sterling Alic:Yes. Got one more question. With how fast this field moves, how often do you find your research direction shifting with news and publications for other organizations? For example, DALL-E 2 just released, what do you do? How do you react?

Adina Williams:Yeah. It's a great example. I've been thinking a lot about DALL-E 2, because we actually got slightly ahead of that one this time. But it goes either way. Sometimes they're ahead of us, sometimes we're ahead of them. But we had recently created a handcrafted benchmark for vision and language to test sensitivity to order. So there might be pictures like there's grass in the cup or the cup is in the grass, and they have the same words in them, but different order, and they have completely different images associated with them. And someone from the DALL-E 2 team has actually been evaluating on our benchmarks. So we got that in exactly at the right time. It's called Winoground for "Grounded Winograd Schemas". It's not quite of Winograd, but it just sounded too cute, so we went with it.

In general, I try to pick things that aren't scoopable but there are so many people in the field that everything is scoopable. So I think the best way to not have your heart broken is just to go with the excitement of finding out the answer, and go with whatever project you find most exciting in that respect. And I feel like this is a hard thing to say to folks as they're just starting out, because there's a high barrier to entry in AI fields. People are coming in with so many publications all the time. It's hard to just search out the thing that you find most interesting and go for it, because you feel this pressure to publish and stuff. I feel like it's just constantly getting more pressure now. But at least, in the last couple years I've been in the field, going with what your heart finds interesting has led me to the least amounts of sadness. I don't know if that's the actionable suggestion, but yeah.

Chris Potts:Really nice piece of advice. I have to ask, so Winoground. That's wonderful. I'll make sure the students have a link for that. What do you make of the fact that DALL-E produces pictures that suggest it has a really solid understanding of intuitive physics? I saw one of an elephant tipping over, an astronaut riding a giant space turtle. And the striking thing is that it knows how things would be configured, but it doesn't know what it means to put a blue box on top of a red box and have them be next to a yellow sphere.

Adina Williams:Yeah.

Chris Potts:Is that a language problem? Is that an opportunity for us?

Adina Williams:I think it is. I think that we can't say that DALL-E's performing super well until it can actually distinguish these things, personally. I am really impressed by it though. I see the examples and I'm like, "Whoa. How'd it do this? What?" But I think one interesting problem at the intersection of Winoground and DALL-E 2 that I've been thinking about is, how do you actually evaluate whether it's been successful? If you have this example of "the grass is in the cup" and "the cup is in the grass". You can create many images for that. Let's say you have K=10, and one of them is a hit, but the rest aren't. That's not great. Ideally all of them would be hits, but how do you decide what success there should be like? Because it seems like if they generate any of that sort of a success, that was surprising to me. And that's, I think, where a lot of the feeling comes from like, "Whoa, it's this good. It did this." But then if you take the full K, it's like, "Oh, but only one of 10 actually did." I think how we should best evaluate there is still ... I don't know what's best. But yeah, hopefully our data set will help people realize this is a problem and innovate on ways to improve this stuff. Yeah, we got a shout out from Gary Marcus on that, in his thread about like, "DALL-E is not working with this red block on the yellow cube business."

Chris Potts:For thwarting the popular system, you got a shout out from Gary Marcus?

Adina Williams:Big surprise, but it made me laugh.

Chris Potts:Sterling, other questions?

Sterling Alic:None that I see in the chats, no.

Chris Potts:All right. So, Adina, maybe by way of wrapping up, I actually really like the phrase that you used before, which is essentially for projects, "Follow your heart." But to be a little more specific: so the students in our course are about to begin developing their own original projects, and we always have an ambition that that will be a stepping stone to them trying to publish or at least disseminate whatever they learned to the community. Do you have some advice for them in a very general way about planning, question choice, methods, whatever is on your mind, whatever you would say to an ambitious group of students trying to make a contribution in our field?

Adina Williams:I think the most straightforward way that I've found to get a impactful project is to take a paper that you think is really great and that you really like and dive really deeply into it. Because there's always something that's not exactly how you would've done it, or something that opens a new route. If you narrow the scope a bit to this one project like, "Oh, this is broken in this way, or this is cool in this way", I feel like that's a bit more manageable than like all of NLP, "I'm interested in translation." It's like, "Whoa, how are you even supposed to get started on something like that?" But if it's like, "I'm interested in this paper that does this cool thing on translation. I want to extend that to X languages or to this new domain." I feel like that's a more easily scopable thing when you're just starting out.

I don't know, in terms of picking a good question, it's just hard. If your question is too good in a sense, you'll get scooped, because everyone else has had it. But I think the best way to figure out what's a good question is to get feedback from folks who are already in the field. They can try to summarize some of their experience on what good questions have been to help you choose that. But that's the hardest part of doing a research project, I think, is framing the question right. People disagree with me on this. Other people are like, "No, it's the implementation that's the hardest part." But no, I generally think the question's the hardest part.

Chris Potts:I agree, but I reckon that the answer to this was latent in your first piece of advice. Because if you pick a good paper in our field, and you read it deeply, you're going to find some limitations and those limitations are very likely to be important. And so I feel like that's a recipe for pretty reliably finding some pretty interesting questions. They might be too hard. Maybe that's the other side of it, but it does seem like that'll be a path to interesting questions.

And then what about the rest of it, Adina? So I've got my question, I ran some experiments, I want to share it with the world. How could I maximize my chance of success there?

Adina Williams:Yeah. That's also hard. Well, if you already have an answer to your question ... when do you decide to share it? It's a tricky thing. Because often you are very excited, and you get your first result like, "Yeah, I'm ready. I want to do it." But you really need to fully argue your point. You really need to branch into many follow-ups that all jointly combine to support the answer to your question, make it believable for everybody. And I find that also very difficult because it's easy to spawn questions, but then how do you pick which of the 500 are going to fit in this particular project? So often I'll just kick up all of them and then bin them into basically different papers. And then just start on one. Depth-first search. But that also assumes you have a bunch more time to do experiments, so you can set them all out and just get to them all eventually. But I don't know.

Writing is also very tricky. I actually write the whole time. So I write for myself a lot when I'm initially planning out. I think it depends on the way you think, but that helps me narrow down what the questions might be and what exactly the answers to the questions should be in relation to. So if you go with like, "I took this paper, and I read it deeply," then your answer is obviously in relation to that paper and that field. But it might also have side effects for other things. And so trying to figure out what exactly, what existing strands of information that's flowing through NLP that your question can be in relation to you, I think is another tricky thing. And writing throughout helps me gauge that. And then at the end it's just salesmanship. Try to be as honest as possible, clear as possible. Make it easy to read.

Chris Potts:For the time management thing, almost a hundred percent of my advising meetings during office hours, say, are like this, "What you proposed is an entire PhD thesis. For your class project, why don't you do step zero?" And the students say, "That can't possibly be enough for a whole project." And I say, "Yes, it is!" And I'm a hundred percent right, all the time. Because if you go deep, you get more quality. It takes longer than you expected. But in the end you have a contribution." Yeah.

Adina Williams:Yeah, definitely. Also, budget more time than you think for all of the phases. Double it, triple it. It's going to take longer than you think.

Chris Potts:And the final piece there, you mentioned salesmanship. So, you've got to do some of that, otherwise your idea won't get uptake, but you also want to value openness in science. You want to be deliberative, but then again, the field is moving so fast that you can't do that for years. What about managing all those conflicting pressures? Do you have some magic words for us there? Or do you just live in it like the rest of us?

Adina Williams:I'm usually just struggling like everyone else. But one trick that I learned recently that has helped is to put the related work near the end. So historically you have: intro, related work, methods and approach, results, discussion or something, and conclusion. If you put the related work closer to the discussion and just have the intro and approach do the framing part of the related work, I feel like that helps with the limitations. Because they can also come in relation like, "Oh, these people did this, we didn't, it might be cool, but we didn't get there." And that's at the end, the people are already pretty much convinced that what you did was nice and good. So you can contextualize those limitations, make people appreciate them. But also they're not the first thing you see. "This other person did this with a hundred languages. We're doing with four." That's not the related work you want at the beginning. So how do you divvy that is one trick that I've found somewhat useful.

Chris Potts:Right. Right. And then it's that same thing which is, in talking about others' work, it's almost always going to be clear that you're doing something different, and then you can just amplify that. And again, it looks small, but in the end it might turn out to be a major thing.

People worry all the time about being scooped. Have you ever literally been scooped on an idea?

Adina Williams:No.

Chris Potts:No? Yeah.

Adina Williams:Nope.

Chris Potts:Yeah. So that's probably like: we make too much of this. There's time to be deliberative. The field is not moving that fast. And the chances that someone is literally doing what you figured out how to do are actually pretty small. I'm sure it happens, but it shouldn't be foremost in your mind.

Sterling, did you have a question?

Sterling Alic:Yes. We have a follow-up to talking about related works. What is your paper writing process and what sections do you write first when you write a paper?

Adina Williams:Oh, that's a hard one too. I spend a lot of time in paper writing. It's actually my favorite part because that's when I come away with the final point of the paper, final answer, the reason why this paper might exist. So I spend a lot of time there. I know people say not to start with the intro, but I usually start with the intro. The intro that I write is never the intro that ends up in the paper, ever. But that helps me at least frame what's coming later. Definitely need section headings. That's the first, things like abstracts, section headings, intro, my order, and then as results ... I also have a summary of methods before I start writing in general, so that I don't forget any details and choices along the way. So that also gets slotted over whenever you have it. I feel like I'm just saying it's a mess. I just write everywhere.

Chris Potts:Embrace the mess, yes. Write and embrace the mess.

Adina Williams:Yeah. I don't know. I don't have anything great to add there. Good luck. It's a messy process.

Chris Potts:Good luck and start writing early.

Adina Williams:Yeah. Yeah. And practice it. Write a lot.

Chris Potts:Anything else, Sterling?

Adina Williams:The more you write, the better.

Chris Potts:Right? Anything else, Sterling?

Sterling Alic:No other questions.

Chris Potts:All right. So, Adina, final question. If you could just share a little bit, what are you obsessed with right now in NLP?

Adina Williams:Oh. Yeah, that's a good question. I really want to make synthetic data for training viable. I don't think it is, but I really want to do it. I would like for it to be viable, but I have no idea how to make that work. But I've been thinking about it a lot. What information we'd need about the synthetic data. How would we generate that? And so I've been generating some semi-synthetic things with replacement and stuff, but I would love to be able to fully generate all of it. And part of the reason for this is because there are so many fairness issues in the data sets that we have. And if you're able to generate it all, of course many choices go into that, but then you'll know exactly what's there. You will be accountable for what's in that training data, even though it might be at large scale. So I don't know. That's what I'm excited about. It might be futile. We'll see.

Chris Potts:So synthetic data generation as a way to address pernicious social biases?

Adina Williams:Yeah.

Chris Potts:Full circle on the place we started. Oh, that's fascinating. Yeah. And just say a little bit more. And is that doing the perturbations that you described or is this creating data from scratch that might have an underlying process that was not biased? What's the vision there?

Adina Williams:So far it's just been perturbations because that's a simple place to jump off, but I would like to do something more complex. I've played a little bit with grammar generation, but of course, if we had a perfect grammar of English. We have a perfect grammar of English? We don't. That's a hard problem. So grammars aren't perfect either, but it would be nice to be able to explore all of the kinds of biases and tactic biases in gender, whatever social biases and all these things, and actually be able to decide just how much of the data se we want to contain each of these phenomena. We want them to interact in this way, and then have more of a causal link between these things, too. So I don't know.

Chris Potts:Sounds wonderful and fresh and original. That's just great. Thanks for sharing that. And that seems like a good place to wrap up. So I'll just thank you again. This was a great discussion. Thank you, Sterling. And thanks to all the students who had questions for us. This was a wonderful discussion. Thanks again for doing it.

Adina Williams:Yeah. Thanks for the invite. I'm looking forward to seeing some of your guys' projects. Bet you'll come up with some cool stuff.

Chris Potts:Wonderful!

CS224U: Natural Language Understanding

Podcast episode: Adina Williams

Show notes

Transcript