CS224U: Natural Language Understanding

Podcast episode: Ellie Pavlick

May 9, 2022

With Chris Potts and Sterling Alic

Grounding through pure language modeling objectives, the origins or probing, the nature of understanding, the future of system assessment, signs of meaningful progress in the field, and having faith in yourself.

Show notes

Transcript

Chris Potts:All right! Welcome everyone! I'm delighted to welcome Ellie Pavlick to this ongoing quasi-podcast-series that we've been having. Ellie is an outstanding junior researcher at Brown who takes very seriously, in my view, the idea that we are investigating natural language understanding, in every sense.

Her timing's extremely good here because, for our course, we're in the midst of studying analysis methods for NLU, and Ellie and her group have been absolutely at the forefront of that. And I regard that as part of this effort to really deeply understand what it would take to achieve natural language understanding and, from an engineering perspective, how we would uncover whether a system that we had built had achieved something like understanding. That's informing not only engineering and NLP questions, but I think also feeding back into questions that we could raise as cognitive scientists and linguists and so forth. So it's all very exciting for me.

I'm delighted to have this chance to talk with you, Ellie. For the group, as usual, Sterling is my co-host. And if you have questions, send them to Sterling in the chat. Sterling, of course, you should feel free to ask questions as well.

So, Ellie, I thought I would dive in by reminiscing a little bit. If I have this correct, we last (and maybe first?) saw each other at the LSA meeting in New Orleans in January 2020. I assume your life since then has been pretty normal. What's been going on?

Ellie Pavlick:Yeah, I think that's right. And I guess this is what people say with COVID: that seems not that long ago. It's weird that that was like two and a half years ago or something.Yeah, I mean, similar to what everyone else's level of abnormal has been! I don't think I've had anything so different from what anyone else has, but it's been the years that it's been. But we also just had a kid. I have a six-month-old now so that's been abnormal, relative to normal. So it's been an exciting couple of years – mostly weirdness, but babies are great.

Chris Potts:Oh, that's wonderful! Congratulations!

Ellie Pavlick:Thank you!

Chris Potts:Does this mean that you're going to fill your house with cameras so you can collect a multimodal dataset to build systems with?

Ellie Pavlick:Tempting. My husband has said no, and we're trying to figure out what level of experiments I can run on her. I was like, "What if I never used prepositions around her? What would it mean?" And he was like, "Don't. Let's not try it, run those other things."

She doesn't talk yet, but she's just learning the sounds right now. So she got on a raspberry blowing kick where she just goes [raspberry] constantly. And I think she thinks she's talking, so that's pretty fun. And you just raspberry back. And I'm sure it gets only better from here.

Chris Potts:I actually suspect there's a deep lesson in there about self-learning and metacognition, just from hearing those raspberries.

Ellie Pavlick:Oh, yeah. You know how, for languages, you don't speak, if you can't say the difference, you also can't hear the difference, like the "pin"/"pen" and things like that. So I'm wondering if, to her, when my husband and I are talking, she just hears [raspberry]. She thinks that that's what we're doing, and she's pretty sure she's also engaging in kind. And then it's only later that she realizes there's more to it.

Chris Potts:Well, you should totally listen to see whether she explores with sounds and sound patterns that don't occur in English, but that do occur in the languages of the world.

Ellie Pavlick:Yeah. Yeah.

Chris Potts:A constrained form of linguistic cognition, I think.

Ellie Pavlick:Right, right. It's a pretty international community, whenever you're around campuses and stuff. We're feeling kind of inferior because she's being raised in a mono-lingual household and literally all of her other friends with kids, it's at least three languages. So we're just like, "I'm sorry you're already behind, honey. There's only so much I can do." But maybe she'll be exposed to those other sounds. We're hopeful. I don't know.

Chris Potts:Well, for the preposition thing, I'm glad your husband is functioning as your review board and nixing it.

Ellie Pavlick:Yeah. Yeah.

Chris Potts:For the multimodal thing, my colleague, Mike Frank, who you might know, do you know Mike Frank?

Ellie Pavlick:No, but I'm going to be on something with him later this year. I obviously know his work and I will finally be at some conference or something with him later this year. So I was excited about that.

Chris Potts:Oh, so you might hear about this. He's gearing up to collect a truly multimodal dataset that is going to involve lots of families and kids wearing helmets with cameras on them. They've put a lot of work into designing these helmets with the cameras. But Mike did confide that he dropped out of his own pilot study with his family. It was so much work to keep track of everything.

Ellie Pavlick:I've seen some data sets like this. There's a couple of them. Video quality is usually really bad, but we've looked through some data and talked about it in our lab and things like that. And it's funny: I frequently, while I'm talking to her, despite not being in this study, I think to myself, "Would this be a weird thing to have on camera if I were in that study? Am I saying the kinds of things that are normal?" So much of the conversation is weird stuff. You're like, "Oh, I'm talking about some random hypothetical scenario of, what might be the case in 13 years," or some weird thing. I'm rarely doing the kinds of things that I hope to see in the data. Like, "This is an apple."

Oh, that was it. She had these blocks, a red circle and a green square. And I was like, "The red circle, green square." And I was like, "God, this such ideal data." But then I had also just picked up some weird toy that was from a friend in Holland and I'm like, "Weird red little Dutch guy." And then I was like, "Oh, if I was studying modifiers, I'd be really annoyed that that was an instance of "red," because nothing about this is canonical in any sense, and now it's with "weird little Dutch guy," which is an awkward modifier phrase to have in there. So I also feel like I'm being watched even if I'm not being watched.

Chris Potts:So maybe I should tell Mike not to accept you into the study on the grounds so that you'll want lots of take twos on all of these things.

Ellie Pavlick:Yeah, yeah! Like, "Oh, that was a missed opportunity to learn about those adjectives. Can I go back?"

Chris Potts:The reason I asked you about this, though, was that I thought you, of all the people I know doing NLU, you might be the one who had a story where research was changed. Like, you were going to purchase a robot and then you decided against it and it actually affected the trajectory of your research. Did anything like that happen because of the pandemic?

Ellie Pavlick:Oh, no. I don't think so. We still don't do anything that you need to be in the lab for. I work with the roboticists at Brown, and they're great to work with, but you also just realize we kind of all have this dream of the robot going around, learning about the world through interaction, and just the limitations of using an actual robot right now, it really prevents the language piece that we want to learn. So we do more simulation and things like that. And I still hope, comfortably within my career, within the next five years or something, it'll be different. But right now, it's like, you spend a ton of time with literally welding pieces of the robot back together. There's a ton of actual work that has to happen. Actual work, not what we do! And then it's like you're really limited by their space of sensors and various other things.

I think the robots are getting better and better. Brown's robotics lab just got some of these Spot robots – those creepy dogs that walk around and they have really nice APIs and pretty rich sensors and stuff, so those might be different. But I had a dream of learning about different kinds of argument structure and stuff like that, and the robots we had could pick stuff up, but they couldn't give things to you. So you're already like, "Oh, they, you're not going to learn these kinds of things." So there's a lot of boring reasons why you can't quite do this study quite yet.

Chris Potts:Well, that's really interesting already because you're saying that it's kind of current practical limitations that are preventing you from doing more with actual robots, and that you expect actually to end up having more robots in your life in the future.

Ellie Pavlick:I think it would be really cool to work with actual robots, to have the actual interaction. I don't know how I feel about it needing to be a physical robot, but I think it needs to be an agent that you're interacting with. There're some software agents that do that. But I think it needs to take actions and have some state space and belief space and interact. I want that part of the robot, and I think it's even better if it's actually interacting with the world and has sensors. But there're boring practical limits right now, but I don't expect them to be problems in the future.

Chris Potts:Well, that's already interesting to me, because we could separate out the interactional part, which in principle two text only pure language models could be doing with each other, or with a human, from the thing that would come from learning via feedback what it's like to pick up different objects, say, just two pick extremes. Or even to smell something.

Ellie Pavlick:Right. Right. We were working on a verb learning project. My student, Dylan, that'll be at NAACL. This one was one of these ones that we learned through simulation. So you have these object trajectories and stuff, but it's really just movement through 3D space. So you get some kinds of things: you can kind of differentiate "pick up," and you can even get kind of "put down" versus "fall." Some things that would have to do with the velocity.

So, you could get some subtle differences, but there was other things. We were struggling with "drop" versus "fall." And when we're talking to the roboticists, they were like, "Oh, well, if the agent, if there's no notion of feeling being held and feeling it drop, it might be a hard distinction to make." Because that motor trajectory is kind of the same between having something be dropped and having it fall, and other types of things. I guess we didn't have "push" versus "nudge," but we had other things like that. I think we had "slap" versus "hit." And those are things that the tactile sense feels kind of important for really getting the concept. So you can see: we can do better than we can do it with text maybe, but it's still not the full thing to really get at some of the nuances.

Chris Potts:It sounds fascinating. What's the model behind this, so to speak? What's driving learning for this agent?

Ellie Pavlick:We were kind of doing the first simplest thing. We had objects in this 3D environment and we had just simulated applying for systems, so we just had recordings of their trajectories. And then we were just doing basically language model pre-training – just predict where the object's going to be next, to learn state representations.

Then it was like a post hoc analysis. Do the state representations cluster according to verbs? I was kind of excited that it did at all. You do get pretty nice kind of semantic-y clusters. And it kind of makes sense because these things, in order to know like where something's going to be next, it makes sense to differentiate between picking up and putting down and rolling and sliding and spinning and toppling, these different types of things. But it is pretty cool. So I felt like it was very much the beginning. I was like, "Oh we could go so much further with this." This was kind of a toy proof of concept. That's cool.

Chris Potts:I have to check this out. This has been so much on my mind. Is the paper available? Feel free to just send me to it but I have a couple more questions I have to ask.

Ellie Pavlick:Yeah. It's not on arXiv yet, but it will be soon, and I'll send you to it.

Chris Potts:Oh, please do share. But let me just ask: is the only learning mechanism the kind of standard self-supervision that language models get?

Ellie Pavlick:Yeah. There's no language in it. It's just predict where the object will be next. And then we've trained probing classifiers over it. So it's pretty interesting.

Chris Potts:But it does get verbs like "nudge" versus "push" versus "shove"?

Ellie Pavlick:When we train the probes. We train the sequence prediction model with no language and then we take the state representations, the frozen states, and we train the probe over top of it to differentiate the verbs.

Chris Potts:Oh, wild! So the language model mechanism, as we call them, the self-supervision mechanism, is being used to learn from sensor inputs about different kinds of pushing or shoving or nudging motion.

Ellie Pavlick:Yeah. Yeah.

Chris Potts:And then, latently, that will help you discriminate whether it was a "nudge internal representation" or a shove.

Ellie Pavlick:Yeah. I was really excited about this. Like I said, I think this was kind of the first step. We have 20-ish verbs and, really, it's getting 3D position, rotation, and velocity. So it doesn't have contact sensors and stuff, which would be cool. And there's no interaction. So there can't be "give" and "take" and those kinds of things. But my thought was this kind of mechanism would be good at encoding presuppositions and things.If you said something like "start" versus "stop," that's the kind of thing where, if you're trying to predict when something is going to stop, you're trying to predict where a thing is going to be in N-plus-K steps, it's useful to know that it's not going to be moving anymore, so you would encode these kinds of changes.

Or something like if we had the interactions –if you wanted to predict where this thing's going to be next, you need a predictor model that has it. So I felt like the kind of presuppositions would be well encoded here as well as some kinds of argument structure-y things, like how many agents are going to be involved in this activity. But we didn't get to extend it to that stuff quite yet. But I feel like I was encouraged by the initial results so we're going to try to extend it to these other things.

The main limitation is really getting the data. We work with the 3D representations, which are pretty rich and they're really good starting points. So there's nice analysis in Dylan's paper where some of the verbs, you really don't need the pre-training. Just having an MLP over the 3D state representation with no abstraction gives you good representations of certain verbs, like "pick up," for example. But other ones, like "rolling" versus "sliding," were ones that you did better with the pre-training, which kind of made sense, because there's some longer-term dependency stuff here. Getting that kind of really good 3D representation at scale is harder to do, but I think not completely implausible. But you would need people interacting in some kinds of simulated environments and recording a lot of environment data. I'm pretty excited about the idea.

Chris Potts:Oh yeah. You could get into things like change of state things like "cutting" versus "putting a hole through" and other things where the instrument would matter more. That's wonderful.

Ellie Pavlick:Yeah. I'm really curious. Maybe I'm just drinking a bit of the Kool-Aid, but just seeing how impressive the language models are, but it's kind of like how far could you push this, the same kind of modeling paradigm, if you were just modeling all the things that were happening in the 3D world. You have some things we know about kids, you have some selective attention. You're not bothering to model every single thing about the world. But if you assume you've got some focus on what things you should be modeling, I feel like you get a pretty rich semantic space out of just trying to predict what happens next.

Chris Potts:Oh, I don't know whether it's drinking the Kool-Aid, but I'm with you, in the sense that this is a fascinating question to ask. It's been on my mind a lot. Is pure self supervision going to be enough to learn something that's indistinguishable from a grounded semantics? So I'm so encouraged to hear that you're doing this because it seems like constantly posing very circumscribed supervision tasks is not going to scale for us, but it also can't possibly be what humans are doing, posing these neatly defined tasks. We get a lot of distributional information and if that's enough for learning rich, realistic things, then it's completely eye opening about what we can do in NLP, but also how cognition might be.

Ellie Pavlick:Right, right, right. I don't think we would say that the models we currently have learn like humans do, but there are some things about it that are more plausible about the type of learning they might plausibly do than training on a half a million question answer pairs or something. It's progress.

Actually, we might have that many parameters, but maybe we don't have that many examples, or maybe we don't have certain things. But there's something more elegant about it, a more plausible feeling.

Chris Potts:I'm curious to ask more about the probing work that you were all doing, but before we leave this topic, Sterling, are there any student questions or questions from you on the topic of grounding and philosophy, I guess?

Sterling Alic:Not so far. None that haven't been already answered.

Chris Potts:I'm actually curious to hear about what you all are doing in the area of probing now, but before leaping to that, I've always just wondered what the origins of the probing work are? Maybe back to the Tenney et al. paper. I wouldn't have guessed that it would work. It worked beautifully. Where does the insight come from? Do you remember?

Ellie Pavlick:I'm trying to remember. Yeah, I have to credit Ian for a ton of that. He was really a driving force. It came out of the Johns Hopkins summer workshop, that JSALT workshop. Before ELMo, we were thinking about trying to build sentence representations and that's what had been pitched. Sam Bowman was like, "We want sentence embedding. We want sentence embeddings, the way we have word embeddings." Skip thought had just been introduced and it was like, "Let's try to come up with better versions of Skip thought."

It was pitched sometime in the fall, all this planning happens. It was planned for the summer. And then ELMo came out, and then there was a lot of work to reroute a bit because ELMo changes things, so let's try to come up with versions like that. Let's think about contextualized models. I feel like it was kind of an exercise in trying to do now what everyone does, which is like, let's just train better and better giant language models. But we didn't really have this scale. So we were trying different pre-training tasks and things like that. And that was the theme of the workshop.

And then BERT came out right at the end of it. So it's kind of in the middle of there. And there's a big paper that came out where we kind of compare – it's basically ELMo trained on lots of different types of supervised tasks. So part of the workshop was try to build better sentence embeddings. And then the part that Ian and I were working on was then trying to evaluate the ones that we built. So we were really interested in the intrinsic evaluations and there was that Alexis Conneau paper, what can you cram into a vector? Like that stuff. So we were thinking about things like that, and then that's where the edge probing came in. Ian was like, "Let's do this for all the different tasks. We have all this great data available. What if we just go through and systematically probe these different things?"

So we were working on that. Then we had a paper on that, and then he was like, "Let's do it layer-wise just to see which layers it works in." We started looking at some of the graphs, and we're like, they have really nice shape to them and then we're like, what if we just like put them all in order? And like, "Oh my God, it looks exactly like what we'd expect!" Like you said, you wouldn't think it would work. It was not a hypothesis-driven study. It was kind of exploratory. Which I feel like now everything I'm doing after the fact is like, is it real or is it coincidence that it looked that way? It wasn't this preregistered kind of a thing. It was: it looks like what we want it to look like. That was kind of the takeaway of the paper, but it was really cool. And I kind of believe it more as we keep going and keep seeing all these other things that I'm also doing.

Chris Potts:Oh sure. So did you probe ELMo first?

Ellie Pavlick:We did. Yes. So actually there's the ICLR paper. It's called "What do you learn from context." And that one was on ELMo and GPT-1, I guess – just GPT at the time.

Chris Potts:GPT-1 would be odd historically, so GPT!

Ellie Pavlick:Yeah, exactly. So we had compared ELMo to the un-contextualized embeddings. So we had a lexical GloVe baseline. We had ELMo. Actually, maybe the version we submitted and that got reviewed, just had ELMo. Then BERT came out after and we were like, "For the paper to be relevant, we have to include BERT everywhere." So we reran all the stuff, included all these models. In the ICLR, I don't know if it's in the appendix or if we just had this whole section that has a footnote that's like, "This was not peer reviewed." We just dumped in a whole ton of results. After it, the BERT results were obviously really encouraging. And then we did a follow-up on just BERT.

The rate at which stuff comes out is ridiculous! So it was just like the models were coming out faster than we could analyze them! And that's still happening!

Chris Potts:That does help me understand the origins, but why did you hedge about the preregistration thing? What's the concern behind that?

Ellie Pavlick:We talked about this a lot – there's concern about seeing what you want to see. All of these things kind of peak in the middle and there's different ways that you can say at what layer does this happen. We don't make an explicit causal claim, but what you would really like to do is make the causal claim, which says, "Semantic role labeling depends on part of speech." Of course, it's not a strictly pipeline – I think actually in the argument, which I believe – is these can feed back on each other, so that once I commit to some parts of speech, that allows me to decide how to parse this semantics, and then when I realize I can't parse this semantics, maybe I go back and kind of refine it.

So they can have these causal effects going in both directions, but there still needs to be some kind of an actual causality between the things. And instead, it could be the case that everything kind of peaks in the middle. So we have a lot of different ways in that paper of trying to compute at what layer does it happen. So we're like: you could take the center of gravity of this thing, or we have: "At what layer does the marginal gain in performance begin to plateau?" We have all kinds of fancy things we try. If you look at the figure, which is the most compelling one, you're like, "Yeah, it kind of looks like they go in the right order."

I have a student, Charlie [Lovering], who's been doing great probing stuff and moving a bit outside of NLP, so he's been looking at reinforcement learning models. We're looking at a game called Hex, which is kind of like Go but simpler. We were trying to do a similar thing where there's these concepts that kind of depend on each other and are required to play the game, and if you look at what layers they peak in, it seems like you learn the concept and then you kind of start using it. But it actually seems like the model gets better. And as it gets better, all your probes get better too. But we couldn't actually find evidence that learning the concept was causing you to be better at the game. And I think a similar thing could presumably happen in language. It's not like it learns syntax and therefore it understands better. I'm not sure exactly what it would look like, but it could be kind of a correlational thing.

Chris Potts:I see. Because that's a different concern than I thought you were raising. Because my own take is we – me and my students – we do lots of probing for various reasons and it's always illuminating. And the concern is actually just that the probes are often too successful. Even using some methods developed by John Hewitt and Percy Liang where you kind of discount what you might expect. Even then, the whole graph will light up with the thing they were looking for. But that tells me that I think it's really there. So I don't have concerns about that. But the causal story, I think, is a concern for probes and I would generalize it: any method that is not doing any active manipulation is unlikely to support causal inference.

Ellie Pavlick:Right. Right.

Chris Potts:But what probing does tell me is that, from a purely data-driven learning process – pure self-supervision in the case of BERT – all of this information about language is plausibly, latently encoded in the representations, and that's enlightening-to-eerie to me, but certainly illuminating.

Ellie Pavlick:Yeah. I felt like some people I think view this as some kind of anti-linguistics thing where it's like, "Oh, all that work on tree banks was for not." But I view it as the opposite. It's like, "This so validating." It's like, "All that stuff you said was there was totally there," and our models were not able to succeed without inevitably finding it. No matter what you do, you keep finding this constituency structure and these semantic roles and stuff. And I thought that was super cool.

Chris Potts:Oh yeah. I mean, that's another limitation of probing, which is it's very dependent on the supervision we do have for the supervised probe. But that seems incidental. And then what I think probing offers is the promise that you would discover what the right syntactic trees were, because surely all our trees are wrong. And you can just peer inside these models and piece together what English verb phrases actually look like, which might be nothing like what we think they do. And that's where it starts, I say to my linguist friends, "Hey, you really ought to be paying attention to this analysis that's happening in NLP."

Ellie Pavlick:Yeah. I've been talking more to friends and trying to think through going in the other directions. For the last many years, it's like we get some things from linguistics and we're like, let's build that into models to make it better. But I feel like now we're actually generating some stuff that could then be vetted in humans and I just really want to see the work in that direction because I think there's a lot of potential for that. Really exciting.

Chris Potts:Oh, I'm so excited! But it's going to require real interdisciplinary work because now you're really have to have mastered all the NLP concepts and all the linguistic stuff.

Ellie Pavlick:Right! I think there are students who want to do that exact work from students. I talk to. I think there's the appetite for it so I'm very optimistic about the next couple years.

Chris Potts:Very exciting. So what else are you doing in the area of probing?

Ellie Pavlick:Yeah. Charlie Lovering is my student who's doing the most in that area. We've been increasingly trying to get closer to this causal stuff, doing some manipulation. So a lot of what his work has turned up is kind of: what does it mean to have the concept? It's not enough to pass the behavioral test to say, "We found some minimal pairs and you got good accuracy," because anytime you do these input–output pairs, you don't actually know what happened under the hood. And then it's not enough to have the probe because they could just be correlated. It's not that that thing that you found is the thing that caused it to make the prediction. It could be that it knows this thing as a noun, but it decided to predict whatever, because it was capitalized. So it's not quite the same as having the right chain. So we've been trying to do these converging evidence type multifaceted evaluations. We've been kind of calling them unit tests or various other things, to say, "You need to pass the behavioral test. You need to pass the probe test."

Actually we have this paper that – honestly, it's in revisions and I'm the one who just hasn't gotten around to just telling him, "Okay, it's good to resubmit." It's almost published and now I'm getting annoyed because I've been talking about it for so long that I forget that it's not published yet, but I'm happy to share with you. But we have these extra ones which are more Fodor. We've been reading a lot of Fodor. And at first I thought maybe this is related to some Jerry Fodor things, and now I'm like, "I still think there's some connection there, but I'm like maybe it's more kind of causal model stuff." So we're trying to figure out what theory is the right one to interface with. But we had this kind of type–token piece, which is basically like the same probe needs to generalize across a lot of different instances, so it needs to be a stable representation. It's not enough to say, "I differentiate." So we're actually not doing language. This is a vision thing. But to keep it in language: you couldn't just say it knows things are nouns, but it has to be the same piece of the network that's calling "cat" in this sentence a noun and calling "dog" in this sentence a noun, otherwise you don't have the concept of noun. So it needs to know that it's a noun. There needs to be a probe that says it's a noun. It needs to be that the same probe applies to all the instances of noun. And then it needs to be that when you go and you ablate that piece of the network that called it a noun, then it changes the prediction, it no longer thinks it's a noun, so it needs to be causally related to the output as well. So that's, I feel, the direction it needs to go, so I feel more strongly that the concepts are there and they're being used in the way that we think they're being used.

Ellie Pavlick:Been Kim is doing really cool work in this area and the kind of counterfactual stuff on the latent space. And a student, Mycal Tucker, at MIT – I think he's working with what Roger Levy, but his main work is in a robotics group. But they had this cool paper called "What If This Modified That?", which kind of does these counterfactual perturbations for syntax stuff and we've been working on that paper a lot. There's a lot of cool work on this causal stuff in the probing. Yonatan Belinkov has always really cool stuff in this area. That's the direction I feel really excited about where I feel like we can make a deep claim, whereas right now I feel like we have to hedge so much with the probing stuff.

Chris Potts:This is really refreshing, though, because what I hear you saying is that, even though in the field, we act like we have only behavioral criteria in mind and we pose behavioral tests, the reason why we always, when a system passes the test, say the test was no good is because we're not actually behaviorists. We actually have something more internal in mind for understanding. So let me ask this as a question. If a system does pass our really hard behavioral tests and you can show, using these mechanisms, that it has the kind of internal causal model that we recognize as correct in some sense, of language or of the world, is there anything more to understanding?

Ellie Pavlick:Honestly, I don't know. I feel like the empiricist in me says no. If you get really pushed, it has to become a religious argument at that point. I don't know what to appeal to other than some kind of soul, and I'm not touching that, so I guess, no. But that middle question – it passes the causal model – that's very hard. The field that I find increasing the most relevant would be comparative psychology, right? So where people try to argue, do crows have a concept of negation? Or: how do we know what counts? And we would need a convincing model of what humans do and what counts as cognition such that we could now say, "Oh, the neural net's doing the same thing, therefore it passes." But I think it's a while until we have agreement on that. People are going to start equivocating about whether this thing really counts as negation or whether this type of generalization really counts as generalization. Or to say, if we were to have a complete causal model of what humans are doing and we could show neural nets are doing the same thing, I would take that, yeah.

Chris Potts:I would, too. Yeah. No, I'm completely on board with this. This is actually funny because I had a prepared question and I think you've already kind of answered, but let me ask it, just to check in. My question was: could a model trained on a high fidelity, digital trace of odor profiles learn to smell even if it had no nose? Your answer's got to be yes, based on what you did with the pushing and nudging, right?

Ellie Pavlick:Yeah, I think so. Yeah. Yeah. I mean, I guess people do appeal to the experience. There's an experience of smelling. But that's the thing that we can never verify in each other, either. So I don't know. I would have to accept that the model smells in the same way you're smelling and I don't really ever know that you're smelling or that you smell the same way I smell, or whatever.

Chris Potts:But that experience of – that could even be part of whatever causal story we tell about each other when we talk about smelling things. And what I heard you say is that, if the system embedded structures that were aligned, in some sense – had the same causal dynamics as – the high-level causal model, that's all there is to smelling.

Ellie Pavlick:I would buy it, yeah. Yeah, totally.

Chris Potts:We're aligned philosophically here. I'm not sure whether that's the dominant view though.

Ellie Pavlick:Exactly.

Chris Potts:But I do think you're right about the field. I don't think anyone is a behaviorist and that's why, when we succeed at our tests, everyone just calls the test into question.

Ellie Pavlick:My psychoanalysis of the field is no one's actually a behaviorist, but they're also very committed to empiricism and they're afraid of entering a dogmatic debate, and you feel super conflicted. Because for some reason, I think people also think that if you say it's not a behavioral measurement, that it's also somehow not a measurement, like it's not scientific or there's no experiment that can be run, which isn't true. There are a lot of empirical studies that can be run other than inputs and outputs. But I feel like people get stuck in this point where they're like, "Oh my God, I don't want to enter a philosophical debate." And so I feel like people don't attack the argument directly, but we're probably all actually kind of aligned on what we feel.

Chris Potts:Makes sense.

Ellie Pavlick:I mean maybe not all aligned, but I feel like the general spirit of what people want in NLP – it seems like mostly people are on the same page.

Chris Potts:What about another aspect of this, at least another aspect in my mind, which would be the behavioral tests that we do pose, the benchmarks? When I think about current benchmarking, I think often about the work that you've done around inherent uncertainty of certain kind of semantic judgements and how that might eventually play into even the most fundamental metrics that we use for assessing simple classifiers and things like that. Is that a connection that you've made as well? Is that kind of the motivation here?

Ellie Pavlick:Yeah. Yeah. Human performance – well one, I think you have to model that to model human performance. If there's not one ground truth, then benchmarks are kind of fraught. And then it's also one of those things – I guess it's kind of like the process matters as much as the answer – it could be that humans disagree on this, so human agreement is 30% accuracy, but each human would be able to explain why they said what they said. So if a model's at 30% accuracy, because it's random guessing, and human agreement is 30% accuracy, that doesn't mean it's human performance. So I think they're very intertwined, these kinds of things.

Chris Potts:Here's an example. I forget whether this is in the paper or just in talks I've seen you give, but imagine we have a natural language inference example, premise, "picture of a barbecue," hypothesis, "picture of a machine." I'm a crowd worker out there in Mechanical Turk land and I need to assign one of three labels. And internally I feel a struggle, because I feel like I could say neutral or entails, I guess, contradiction or entails. But, in any case, I'd have uncertainty about the labels and then I would just make a choice, and then we would aggregate over lots of people doing things like I just did and call one of those the "correct" label probably. And you observe in the paper that actually you can see this kind of bimodal distribution that maybe was happening internal to all of the annotators. What's the ultimate takeaway from that, beyond the insight about semantics and cognition? What's the takeaway for NLP?

Ellie Pavlick:Yeah. My not-super-philosophical answer is I don't think these evaluations are that good. It sounds simple, but I've been dealing with the entailment task since my PhD and it's been bothering me since my PhD, because every time you got the Turker answers back, I would try to quality control the Turkers and then I would look through at the people I was going to reject. I was like, "Well, I can see that. I can see that." It was really hard to do.

I just feel like it's very unnatural. It's just not a thing we're ever asked to do – "Is this entailed?" Maybe occasionally. When we first wrote that paper and I was talking to people, my feeling was what we want is more like a language modeling task, which then makes me actually kind of surprised at the language models. If you look at linguistics papers and they're going to look at something like lexical entailment type of phenomena, it's more of a "Can you accommodate this?" So you wouldn't say, "There is a barbecue. There is a machine. True or false?" You would say something like, "Well, she was standing there by the barbecue and blah, blah, blah, blah, blah," and then, "Well, she could barely hear anything because she was right by that machine," and then you would just be like, "Oh, okay, that's fine. I guess I was picturing it wrong, but I can accommodate," or you'd be like, "Wait, what the hell are you talking about? I don't understand." That's the nature in which these judgements would happen. So maybe you'd be kind of entertaining both things at the same time and it would be like, "Am I willing to take this in stride or not?" That feels like a more natural way of doing it where you don't have these forced choices and you can kind of push back if it really goes against what you were saying. I've always wanted to set up that study. I feel like we might get higher agreement in something like that. So my takeaway was just these kinds of evaluations, we've probably gotten as far as we can get with them.

Chris Potts:I really like that. I mean, my thinking – inspired by your work is kind of like this: we could have a bandaid over this that just says, "Retain the response distribution that you got from your workers and let's have models predict those distributions." And if your model can predict that this one was bimodal and this one wasn't, then it's more successful than one that only recognizes essentially what we're saying is one reading of the example. That's that's like a bandaid.

Another bandaid would be: let's give crowd workers some freedom to assign a distribution or multiple labels or something, and that would feed into the some other bandaid in my metaphor. But the really deep thing is the human capacity to recognize that, in different contexts and different communicative goals, the label is actually different. We don't even try to develop systems that can do that thing. But that would be meaningful. That would be more than a bandaid.

Ellie Pavlick:Right. I think my immediate thought was, "Oh, we should model the distribution," which is kind of a less wrong thing to do, but it's definitely not the end goal or the right thing to do. Maybe it's a bad diagnostic. And I thought, if a system were understanding all the context and thinking about how it's different in different contexts, it would also be able to anticipate this distribution. But it's also definitely the case that a model could predict the distribution without encoding all that interesting complexity underneath. So I think it would be better if we could just directly do the thing that humans can do, which is entertain these multiple contexts in mind. And I actually think what humans do is even more impressive, because it's not like you pre-anticipate all of these, right?

You can come up with it after the fact – only after being confronted with the weird contradiction – then you can go back and backfill a context. People are very, very capable, almost to a fault.

This is kind of a tangent, but I've noticed – I was in a lab meeting. One of my students was in a lab meeting once and someone was sharing their screen and talking about some figures and the slides were wrong and not advancing correctly. So they were just hovering over the graph and describing parts of the graph. They were looking at some scatter plot. And we were all looking at a bar chart, and we all made it work. We were like, "I see what you're saying. Okay. Maybe you need some other baselines," but no one was like, "This makes absolutely no sense." You made it work, and that's a remarkable capability – to twist everything that's being said in a way that completely fits an entirely different world, but we're very willing to do that.

Chris Potts:Absolutely. I mean, your example that you gave with "barbecue" and "machine" was one of those cases where, whatever I believe prior about "barbecue" and "machine," to recognize your intentions and make sense of your discourse, I infer that you believe they're in an entailment relation. And then I can kind of make sense of it all. Yeah. That's what her Herb Clark calls "lexical pacts." And that would be where I kind of accommodate to you, maybe just temporarily, and we're incredibly flexible about that.

Ellie Pavlick:Yeah, yeah. Yeah. So I don't know how we could quite test computers. I guess we could say, "Oh, the ultimate test is full language understanding," but that's not very satisfying. It doesn't give us anything to do right now. And it is nice to have something that's a more immediate test of the lexical semantics we care about, but it feels like we could set up something that gets more at the discourse and the accommodation as a way of trying to test whether the model was modeling it.

Chris Potts:And maybe the interactional part. I mean the crowd worker out there is probably just really eager to ask you a question: "Why are you asking me about this premise–hypothesis pair? What is the issue? And that will guide my label choice." Or maybe just negotiate with you. "What would you like me to assign here? And then I'll apply the lesson forward." All those things are kind of the human capability, in my mind. Not the classification decision. That's kind of like a side effect of something much deeper that I would love to probe directly.

Ellie Pavlick:Yeah. I agree. Yeah. Yeah.

Chris Potts:But just expanding out, do you have a vision for how benchmarking might change, or should it stay the same?

Ellie Pavlick:I've never been a benchmarker. I've been proud that I've never submitted a paper that gets state of the art on any benchmark, or even really evaluates on any benchmark. I just don't do benchmark papers, so I don't feel a big loss if benchmarking changes.

I really very much acknowledge how much benchmarking has driven our field forward, but I'm also okay if we're just done with benchmarks for a little bit. I think that it might be the kind of thing where there's really big gains to be made early on, just like let's get from the random, – the look-up table machine translation that's doing one word at a time – to better. And benchmarking really helps drive that.

But when we're at the point we're at now – I don't know how to tell other people what to do, but my lab, I feel like we're just trying to do more, like borrow from the natural sciences. More theory-driven stuff, more hypothesis-driven stuff. I don't think we can say this model is better than that model. We can say this model is consistent with this theory of understanding by these criteria. And then it's kind of up to us or other people to decide which theory of understanding we want we care about.

To the extent that we want a really good QA system, I think the current benchmarks are doing pretty good because we've been crunching on them and we have pretty good QA systems and maybe we can extend them with some challenge sets and things like that. But if what we want is a really good system to do that thing, I don't think there's a fundamental problem. But if we want natural language understanding, I don't think benchmarks are the way to that. It just feels inherently theory-dependent and so I feel like we should just own that and run these individual kind of hypothesis-driven studies on each model we're looking at or for each phenomenon we care about.

Chris Potts:That's really helpful, actually, because that's making me realize that the term "benchmark," brings a particular frame that does imply that we want to be ranking models, benchmarking the models in particular. I guess there's a different perspective, which could just be: we're going to develop data sets that are going to be our fuel for developing systems and also our tool for testing a hypothesis about the degree to which a system has a capability. And from that perspective, they have a different kind of importance, which is just that they are the fuel, and whatever is codified in the data sets, those are the capabilities our systems are actually going to obtain or not.

Ellie Pavlick:Right.

Chris Potts:From that perspective, do you think we're okay?

Ellie Pavlick:Yeah, I kind of agree. I think we've seen a lot of problems, though. There's a desire to collect reusable data sets, and I don't want to say I don't think it's a good thing to do, but if you look at other fields, it's not commonly the case that you reuse data across multiple papers. There's a cost to collecting some data to test your version of the hypothesis and other people might use it for some kind of reproducibility, but it doesn't have to be the case that data can be reused many, many, many times.

It's different in computational models, but I do think that sometimes the data kind of has to be model-specific. We want to look at how this model does this thing and almost by definition, the next time you want to evaluate a different model, you're going to need to do it slightly differently. Otherwise, what happens is we quickly over-fit to our data sets for the wrong reasons.

Actually, the paper I worked on with Tom McCoy and Tal Linzen, the HANS data. This is a really good example. We think these models are using the lexical overlap heuristic. So we're going to construct these stimuli that have high-lexical overlap. We show that the models fail. It felt like it was really clean and I loved it. But now the data – it still gets used correctly for a lot of cases where it's like we keep evaluating models on this thing, but then people also kind of train on it or they fine-tune in a way that allows you to do better on that task. It suddenly starts to not be nearly as powerful of evidence that your model's doing the right thing as it was when it was kind of designed cleanly and evaluated the first time. I think it's not such a bad thing if more people have to do the work of collecting data and if data has kind of a shorter shelf life.

Chris Potts:Are you thinking primarily of assessment here and not so much training? Or are you lumping training into this?

Ellie Pavlick:Oh, I'm thinking of assessment. I think training could be reused more.

Chris Potts:Yeah. So then my question will recur again, which is, "Well, that's going to shape what these systems do, at the very least." And so it's going to say, for example, that for natural language inference, it's okay to reduce it to three-way classification, because that captures enough of the capability. Another response would be, that's not even close to what it means to do reasoning in language. The whole idea is busted. You need to have entirely different data sets.

Ellie Pavlick:Right.

Chris Potts:Which I think is a takeaway I could have from this TACL paper about NLI. That the tip of the iceberg, so to speak, of a very deep problem, relating to what it means to reason in language.

Ellie Pavlick:Yeah. I think I feel that way about NLI. Again, I think, when NLI was created, our systems were so much worse. So even being able to do the three-way classification required a lot of progress. But I think right now, no matter how deep or complicated you make the premises and hypotheses and the inference connecting them seem – I guess this isn't based on anything super empirical, except I just feel like – that's not the right task to be training for right now, is this three-way NLI classification.

Chris Potts:That's fair, but like we could extend it to QA too, right? There's a little history there where we want to have our data sets to have unanswerable questions because we don't want systems to just try to answer whatever they can. But then that gets reduced into just another class and what is ultimately a classification problem. Whereas, the human thing, in response to a so-called unanswerable question is highly varied. It could be, "I need more information from you about your intentions or your intentions with the language," or "I can answer part of your question, but not all of it. Is that helpful?" All of that other stuff recurs. It's essentially an interactional thing that we've reduced to a one-off classification problem.

Ellie Pavlick:There's probably – maybe I'm speculating on some technical stuff – but I feel like if you have a three-way classification, ultimately your neural net needs to learn three clusters, that's all it needs to do. And if you look at that last layer, things that are very semantically different will get collapsed down into which in this three-element softmax am I going to do? And so, unless you believe that natural language semantics has three clusters, then that's not the right thing.

Obviously before that, it had more than three clusters and then it just got mapped down to three at the end. But if the whole game is to get it into these three clusters, you don't have to do anything more than that. I feel like the generation tasks have a lot more – they're doing a really good job because they're so complicated and you can't really master them. There's not one right answer and they're empirically proving to do a really good job. I feel like classification – it's too easy basically. It just doesn't force you to learn hard stuff.

Chris Potts:So we should capture that in our data sets so that people are pushed to do more difficult things with their models.

Ellie Pavlick:I think language modeling is proving to be really good, probably for this reason. We could keep going a little further with these kinds of interactions, and things like that. But I do think we should get more creative with the training, but I guess, like we were talking about at the beginning, I think less about supervised tasks and maybe more: collect a lot of good observation and interaction data and then come up with some kind of less objective objectives – squishier objectives, like predict the future or model other people's reactions in states and things like that. That kind of squishier objective.

Chris Potts:I mean, if I thought about all the things you've said, I guess it wouldn't surprise me if, in 15 years, let's say, you and your group are just doing a lot of multimodal language model development or fine tuning – but it's all kind of distributional learning – and then posing what would essentially be human subjects, kind of HCI studies, with the resulting artifacts to see whether a model and a person could achieve a goal together or two models could achieve a goal together. And that would be as far as you could get from all these classification decisions we currently evaluate on. What do you think about that for your future?

Ellie Pavlick:Yeah, I mean, I would love it if we were doing that. I would totally consider that. I'm very interested in the why question. So for the same behavior argument, I still feel like I wouldn't be satisfied even if people were having a very productive interaction with this model.

Chris Potts:We can also do brain surgery on the models!

Ellie Pavlick:I would also want to do the brain surgery part, yeah. But on the humans as well as the models. No, no, no! On the models! As long as we have that piece, too, then I would love that. I feel like that type of an evaluation feels more real. But it's kind of like the Turing test. Maybe there's nothing creative about that. And the Turing test, I guess, has some very specific parameters, but it's that intuition. You can interact with it like you would a human.

Chris Potts:I did almost say Turing test, but I didn't, because I don't think it needs to be adversarial. I don't think we need to hide that one of these things is a machine. I think if they're trying to achieve something together, that might be the whole purpose. In which case, knowing it's a machine could be helpful. But certainly, thinking it was a human could be just downright unproductive.

Ellie Pavlick:Right, right, right.

Chris Potts:I have one more question and then I'll check in with Sterling on student questions. And it was just about the HANS paper that you mentioned. What you described very briefly was essentially all the tools and the toolkit of NLP as a response to this challenge. Let's train on a little bit of the data. Let's train on different data and hope it helps. But of course, let's keep assessing on this word-overlap benchmark to see whether we're making improvements. And it sort of sounded like you thought none of those things is the best thing to be doing. So what should people do when they see a failure like the HANS failure?

Ellie Pavlick:Yeah. Again, I'm hesitant to say this is what people should do, because I think it's really important that we have a lot of people that have slightly different scientific philosophies doing the same stuff. I think that's been our strength. A lot of the models we're currently looking at that are amazing are the product of just trying stuff and seeing what works, and I think that's been a good thing. So we've always had a strong engineering kind of hacker–tinker culture, and that's been very good and produced a lot of really good stuff. So my feeling is we want theory-driven, hypothesis-driven studies. But if that's all we had, we wouldn't have the models that we have to analyze.

So I don't think it's bad that people are trying to hill-climb on HANS, and very likely good things will come out of that that will give us some insights and stuff, but it's just, we don't want everyone to hill-climb on HANS. And I don't think we have that. We have a nice diversity. And it's not, I think, this need to say, "What is the bar?" to commit to what counts as enough and say, "We're locking that in. This is the understanding." So it should be that then once people are winning on HANS or crushing it or human level, whatever it is, the goal-post might move. Now there's a new thing and somebody's going to be looking and saying, "Oh, you improved on HANS, but you introduced all these other artifacts," which is actually what I tend to see happen: you train on something that looks HANS-like at this data augmentation approaches, which are very popular, and I'm just kind of squeamish about, because it's like, "Well, now you've kind of introduced this artificial data into your training data and now the models kind of learn some sub-network, which is how to deal with artificial data, but then still doing all the same crap on natural language." There could be weird stuff. So somebody else is going to come in and figure out that that's what's going on and now there's a new bar, but I don't know. I think we need both.

I feel like people do like to really criticize right now and be like, "Oh, you guys are doing it wrong. Everyone should be doing this thing." Or, "This kind of work is meaningless." I don't know. I think it's been really good for the field that we have kind of the let's build bigger and better and let's see what we can do. And then the skeptics and the critics that are not actually building models and just criticizing everyone else's models, I think those two things work really well together, so I think we should kind of keep doing that. It's hard to deny we're in a heyday right now. We're doing a lot of things well. We should kind of keep doing what we're doing as a field. We're building cool stuff and we're getting cool insights. Seems like things are going well right now.

Chris Potts:Oh, interesting. Well, there's a whole bunch here that I would like to unpack, but let me just pick up on that last thing, because I had another prepared question. Maybe it'll cause you to reconsider. But the way I framed the question was this: throughout AI right now we constantly tell ourselves and the world how successful we are. And I suppose that's good for morale. Everyone feels excited and energized and successful, but here's the question. If you had to convince a smart, but non-technical skeptic that we're as amazing as we claim we are, what would you use as your evidence, specifically for NLP? What systems would you cite to convince this hard-nosed skeptic that we're actually deserving of all our self-praise?

Ellie Pavlick:Yeah. Yeah. It's an interesting question because I feel like I'm always on the opposite side. I try to convince people we're not as good – that the hype is over-hyped. It's mostly the argument that models can be passing all of our benchmarks, but under the hood, they're using some tricks and yada, yada, and that's kind of the line of argument I use. Even though, actually, personally, I've been more and more impressed and I think they're actually using fewer and fewer tricks. But I still am kind of the mind that it's not solved. I think most people get more excited and I try to tell them it's the opposite.

So if I was talking to somebody who is more skeptical and trying to convince them it's really excitin? This is why I've had some discussions with journalists and I feel like I quickly lose their attention because I don't have a nice splashy take. It's not like, "Oh it's because GPT-3 is writing poetry and that's why I'm excited." Those are cool, and PaLM is explaining jokes. It's not that. It's literally like when we go through and we do these probing analyses and we find this really toy problem and we pick out this thing, we're like, "Oh, actually it does seem like the internal structures are kind of nice. They're kind of being reused. They still pick up on some artifacts. They still fail in these catastrophic ways and stuff, but the bones look kind of good." It's not like we go in and pick around and it's all full of feathers and crap. It's like, "Oh no, actually we're finding the stuff we're finding. They're connected in the ways that we want." Yeah, it got kind of confused, but it can be forgiven for that because the data had that artifact. It didn't know. So I feel like it's these super small studies, and it's in particular because my lab tends to be more skeptics, and we go through trying to find ways that are wrong. But it's often like, "Okay, it can't be as good." And then I consistently have to roll back and I'm like, "Oh actually, that's better."

I mean, maybe if I had to pick one study: my lab's just been excited about this one from Yonatan Belinkov's group. It's the Finding and Editing Knowledge in GPT-3. I don't know if you've seen this one. It's like GPT-3 says the Eiffel Tower's in Paris, and you can go and update a part of the parameters to make it think it's in Rome, and now it not only says Eiffel Tower's in Rome, but it answers other questions. If you're like, "I want to get lunch after visiting the Eiffel Tower. Where should I go?" And it recommends some restaurants in Rome and stuff like that. It's cool. Those kinds of things, I'm like, "Oh yeah, it's based on the right kind of stuff."

Chris Potts:This is amazing because – well, I'll put some words in your mouth, because in a way this is the most optimistic answer anyone has ever given to this question, in my experience. Because what I heard you saying to this stranger, you say, "Yeah, I grant you that we haven't done anything like curing COVID, so it is a little weird that we're constantly telling ourselves how amazing we are, but the reason everyone is so excited and energized is that we feel that at long last we're actually on the right path to achieving the really grand vision of artificial intelligence. We've seen glimpses of it, so we feel like it's no longer that we're building a really tall ladder to get to the moon. It's like now we know about rockets." Is that what you're saying?

Ellie Pavlick:I kind of want to say yes, but I'm afraid of being the one who then, in two years, people are like, "You were the idiot who declared mission accomplished and then this!" You know how this instantly happens, where people are like, "Oh we've solved it," and it's always right before the whole thing is revealed to be a sham. Like we're going to find out GPT-3 is really OpenAI It just has some people manically typing answers to all of your questions. There's just some humans back there answering all your queries. And we're going to be like, "Oh God, we're so stupid. That's why." But I feel like, when we look into it, it's like: they do some dumb things, but it doesn't feel like it's so far off to me. It's not like with Eliza or whatever, those old ones where you look and you're like, "I can see the algorithm it's using and it's the wrong one." Here, it's like, we get these interesting conceptual structures. There are things that we're definitely not doing, like this kind of complex higher order combinatorial logical reasoning or something. But we also haven't seen anything that makes it really clear it will not be able to do that. So there's not anything that I've seen that's like, we're definitely going down the wrong path here. And actually a ton of the stuff it's doing, we used to think it wouldn't be able to do, so it makes me very optimistic, I think.

Chris Potts:Well, we'll worry together. I'm putting words in your mouth for a reason because I just enthusiastically posted on YouTube a talk I gave called "Could a purely self-supervised language model achieve grounded language understanding?" where I boldly said yes.

Ellie Pavlick:Oh, excellent.

Chris Potts:So obviously I'm looking for people who are also going to commit publicly!

Ellie Pavlick:Yeah, yeah, yeah. So we'll go down together! Yeah!

Chris Potts:Both of us under the mission accomplished banner.

Ellie Pavlick:Right. When I do my Language Understanding: Humans and Machines, there's this quote from like Minsky or someone that's like, "We're doing a workshop on machine translation. We're getting together for a month. We plan to solve it in a month and then we're going to go out for drinks and do stuff for the second half of the summer," or something. And I feel like this will be quoted somewhere in 50 years, like "Here's Ellie saying they solved it," and those people.

Chris Potts:Oh, I hadn't made that connection before. That was John McCarthy saying that about the Dartmouth Summer School.

Ellie Pavlick:Oh yeah. That was it. Yeah, yeah.

Chris Potts:It was absolutely an all-star team.

Ellie Pavlick:Right.

Chris Potts:Claude Shannon was one of them. So if history looks back and says that we are analogous to them, I can live with that.

Ellie Pavlick:That's true. Yeah. So maybe that was my attempt to be modest and super self-congratulatory. So yes, I will probably not be quoted in any book, but I will have said a similar thing to the famous person who was quoted saying the embarrassing thing.

Chris Potts:Sterling, any student questions before I switch gears a little bit?

Sterling Alic:Yes, we do. I have one question that says, "Do you think the features – for example, the ability to reason, acquire knowledge, and deal with long-term dependencies – of super large scale models, for example, PaLM and CLIP, would be drastically different from pre-train language models that we commonly study, such as BERT, RoBERTa?"

Ellie Pavlick:Sorry. They asked whether the features from PaLM and CLIP – what was the thing that unifies those?

Sterling Alic:Large-scale models versus commonly studied pre-trained language models like BERT or RoBERTa.

Chris Potts:Oh, no. So BERT isn't large-scale anymore, Sterling? Time flies by.

Ellie Pavlick:So the question is, does the scale really matter here?

Sterling Alic:Yeah. I think it wasn't calling as BERT being large scale, they're just saying other pre-trained language models versus the larger-scale models of like PaLM and CLIP.

Ellie Pavlick:I see. I wasn't sure if they wanted the multi-modality, CLIP being a multimodal model.

Sterling Alic:I see.

Ellie Pavlick:Yeah. I feel like I'm willing to believe that this scale would matter for some things. One thing is we've seen benefits of scale. There's also, again – the theory is out of my expertise, but – there's these arguments about things like the lottery ticket hypotheses or reasons that the larger models might learn better and optimize better because you can kind of entertain multiple things and prune and do that stuff more than you can for the smaller models. So I guess I would feel like, based on what we're seeing, it would be naive to say scale doesn't matter. But that doesn't mean that once we get the large-scale models and we see them working, we couldn't replicate it on a smaller model. But, objectively, the bigger models are performing better, and it is the case for certain types of things. Even these probing types of studies get better with larger-scale models, too. So, I'm speculating, but my gut instinct is that scale will matter up to a point. If we start going from trillions to many of trillions, I'm not sure because one is the human argument – like we have, I think I looked this up, there's debates, but I think we're maybe we're arguably the size of GPT-3.

Chris Potts:Ninety billion parameters, in the sense that we talked about.

Ellie Pavlick:Yeah, so that order of magnitude. But trillions, we don't need that. You should be able to do it with billions. So that's one argument. And then also just an intuition, there's the kind of bottleneck argument, which is, if you have so many parameters, you probably don't have any incentive to learn abstractions, although we're already over parameterized, so take that as you will. But I imagine there will be diminishing benefits to scale, but I don't think scale is a red herring here. I feel like I'd be a bad scientist to look at the current data and say scale doesn't matter.

Chris Potts:What about this, Ellie? Let's just call it a hundred billion is sufficient if you're getting rich multimodal data, but if all you're getting is text, might 500 billion do it?

Ellie Pavlick:Sure, that sounds good! I think that it's going to be some function of what you're taking in and what you're trying to do. And there's always the question of, what are we trying to learn from scratch and what's built in and things like that. So you could imagine some kind of pre-training schemes and things where it maybe builds up to it. You could pre-train some smaller parts of the network. Okay, now I'm doing multimodal again. But, for example, basic vision. You could pre-train the visual cortex of the model that can be much smaller to just do some basic edge detection and shape detection. You don't need a hundred billion parameters to do that. But then when you're starting to synthesize it with other things, then you need those other ones. So it's not like everything needs to be trained end to end from scratch, the whole thing at once. So maybe some text processing part does not need all of that.

Sterling Alic:Awesome. And then we have one more question about career. So after completing your PhD in CS, what led you to continue in academia versus industry?

Ellie Pavlick:Yeah. People always ask that. I feel like I never really wanted to go into industry. I like my work at Google. I really like what I'm doing. It kind of allows me to work on some of these big models. It's exciting. I have great colleagues there. There's a gazillion NLP people, whereas at Brown there's not many. But I just really view myself as an academic. A lot of these questions, the kind of philosophical ones, I like to be able to slip between fields pretty easily.

When I was in my PhD, I did more linguistic semantics stuff. Actually, I really only started doing that in the last couple years of my PhD. I got into that one, too, for a couple years. Now I'm like, I feel like more CogSci/Psych stuff is relevant. I've been talking to some neuroscientists. It feels like that's the kind of stuff we need to do now. And when you're at a university, you can then just be like, "This is what I need to do now so I'm doing it." And there's very few barriers to that. Finding those people when you're in a company would require convincing the company to get those people or reaching outside the organization.

To me, I feel like you can just kind of follow what you know to be the right thing to be working on and no one's really checking up on you. I mean, you have to get tenure, but as long as you can crank out papers and do some stuff. There's some pressures, but no one's really like approving your choice of research. So if you're like, "I feel like this is the thing I need to work on right now and I need to go learn all this stuff that I don't currently know," there's just few barriers to that and it feels much more right.

So I don't think I was ever considering not going into academia. I like it. I like having all my students. I like working on a zillion things at once. When you're a faculty member, you kind of get a whole team of people all at once and you're like, 'Oh, we're just going to do a zillion things." If you go into industry, you're going to work on your one project at first and then get involved with some other things. So I don't know. I think it was kind of like the fact that I'm also at Google was like this bonus. I'm like, "This is great. I'm also doing it," but I wasn't seriously considering industry, ever.

Chris Potts:That's good to hear. We interviewed Adina Williams and I asked her about life at Meta/Facebook and she said, "You know, a lot of it is kind of like being a faculty member. For example, we have meetings that are kind of like faculty meetings where we get together and talk about big research ideas," and I cut in and said, "I'm not sure. It sounds like you haven't been to many faculty meetings."

Ellie Pavlick:No! Yeah!

Chris Potts:That sounds wonderful. If only my faculty meetings could be about ideas.

Ellie Pavlick:Yeah. Actually, we have had some faculty meetings where we've tried to. People were like, "We want to hear what each other is doing. Let's have faculty give these flash talks about their presentations." So give a five minute talk about your research at the start of faculty meeting. And then like 45 minutes later, it's like, "Please stop talking. We have other things to do." It's just monopolized the whole thing! Which is great! I would rather listen to something about cryptocurrency than vote on the flyer for the cybersecurity Masters or whatever we're doing. But I think that's the reason we don't talk about research in the faculty meeting, is because everyone so much wants to do that that they're like, "No, we're going to keep it boring and dry and you're going to vote."

Chris Potts:I'm glad you put in a good word for academia, but I do feel like I should ask the other side. Is there something about being at Google AI that enables you to do things that would be harder in academia? Is that the attraction?

Ellie Pavlick:Yeah, definitely. Well, you guys are at Stanford, so maybe you're not as resource-impoverished as other universities, but you talk to most PhD students these days and they're like, "We can't work with the models." They're all coming out of industry and everyone's dealing with that.

Chris Potts:We're in the same boat.

Ellie Pavlick:So yeah, it's true. Like PaLM came out. I can play with PaLM. That's cool. We can work on the PaLM model. And I couldn't do that if I wasn't at Google. That's really a huge part of it: right now where the research is, it's happening in industry. That deeply bothers me. So I'm working on the Big Science effort, and I think there's increasing. We should change that. It's not fair for. It's not even like a fairness thing. It's not good for science to have work happen that way.

But it is the case that industry is a very exciting place to be right now. And I feel like it's the way people talk about Bell Labs, so there's maybe some FOMO. You're at these companies and there's so many cool people. There's so many smart people there. There's hundreds of NLP people working on this, and vision and all these other things, as opposed to me, at Brown, I guess there's a couple other NLP-adjacent people who I love working with, but it's not like there's 100 people who have all read the paper about the Rome thing that you can talk to all of them about. So there's a lot of excitement about it in it right now, and I'm very glad I'm there and I get a ton out of it, but I also have this feeling that it won't always be that way. I feel like it'll be something where, at the end of my career, I'll be like, "Oh, I remember being at Google in this time period and what a cool place to be." Whereas I'm pretty confident that the work I want to do, I'll still be able to do 40 years from now, or whenever, in academia.

Chris Potts:Right. Cool.

Ellie Pavlick:But I think industry is a very exciting place to be right now. And most of my students want to go into industry and I do not blame them for that. I encourage them. Like, "Yeah, go do it. It's possible. Also, you'll make way more and you can come take me out to a nice dinner."

Chris Potts:It's so true, sadly!

Ellie Pavlick:Yes.

Chris Potts:Do you have time for one more question?

Ellie Pavlick:Yeah, sure.

Chris Potts:This is just about our course. The students are really embarking now on their final projects. They're doing their lit reviews, which are due on Wednesday and then they kind of do an experimental protocol document, and then a final paper in the usual way. Do you have any advice for them on choosing topics – or whatever comes to mind for you when you think about advising eager young students, maybe just starting out in the field of NLP?

Ellie Pavlick:Well, I don't have sweeping advice, but maybe something more immediate and practical, just having worked with other students, especially as just starting out, is: pick something you're genuinely excited about and then unsubscribe from arXiv, just don't look at arXiv for a little bit. I see students who come in genuinely excited about a thing, and then they just doubt themselves and doubt themselves and doubt and they keep switching and keep switching. Somebody else is doing something cooler and you probably had good intuition about something that was actually fundamentally interesting. Just work on that. And you'll probably learn something working on it, even if it's a slightly smaller model than what is now the close model or something. But I would say don't spend too much time on arXiv. It'll be okay if you miss, like you didn't hear about a paper.

Chris Potts:Right. Sure. I mean, I have so much faith that, even if they happen to be doing something that is in an arXiv paper that appears tomorrow, they'll do something different that could lead in new directions.

Ellie Pavlick:Yeah. I've said that – I think like people always feel like they were scooped and they often weren't scooped. There's something slightly different about what you were doing and there's a different insight and stuff, but I think it's stressful to be a student these days so I would say, "Have a little faith in yourself. You probably actually have good intuitions about what's interesting."

Chris Potts:Is there an area that you'd steer them toward where that you think they might be able to have an impact because it's new or vibrant in a particular way?

Ellie Pavlick:I feel like probably most areas. I mean, I would say, don't try to come up with a slight variant, like trying to come up with a slight architecture variant or a learning objective variant that'll improve a pre-trained language models.It's not that they can't be improved upon, but I just don't feel like that fits.

I think evaluation work is a good thing to do. I'm biased because I like to do that. But I feel like you learn from negative results and positive results. So the chance that you learn from something, no matter how it turns out, is high. So I do think evaluation-based work, it's rarely time wasted.

I don't have a specific area. I mean, obviously, I love all the linguistics-based stuff. I think pragmatics is under explored right now. That's a really exciting area that there's not a ton of work on. I think multimodal stuff is super exciting. For a final course project it could be hard because data is a limiting factor, but I think all of that is exciting. But I do think evaluation-based studies tend to lead to something.

Chris Potts:Especially, I mean, recalling our first theme, if they could combine, say, an adversarial task or some kind of behavioral evaluation with something that did allow them to look inside the model, I feel like that's a vibrant area, doesn't cost you a lot in terms of resources, and the landscape of ideas is hardly explored at this point.

Ellie Pavlick:Yeah. Like a buzz phrase I've been using more is converging evidence. I feel like this is the thing we need, and multiple ways of evaluating the same thing. That seems like a useful use of time.

Chris Potts:Absolutely. That should be really inspiring.

Well, thank you so much for doing this, Ellie! This was a wonderful conversation!

Ellie Pavlick:Yeah. It was great to be here. Thank you.

Chris Potts:Yeah. And thank you to Sterling and to the students who asked questions. All the best to everyone, yeah!

Ellie Pavlick:Yeah!