CS224U: Natural Language Understanding

Podcast episode: Rishi Bommasani

April 11, 2022

With Chris Potts and Dhara Yu

Deriving static representations from contextual ones, interdisciplinary research, training large models, the "Foundation Models" paper and CRFM, being an academic on Twitter, progress in NLP.

Transcript

Chris Potts:All right! I am excited about this! Welcome, everyone! This is our first special event of the quarter and I'm delighted to have Rishi kicking it off for us.

Rishi Bommasani was an undergrad doing NLP work and related things, like math and ML and stuff at Cornell. And now, he's a student in the NLP group here and a PhD student in computer science.

You know him from Bommasani et al. 2020, which is a paper that you probably were just working with before joining this call, maybe as part of your original system. And you might know him also as the lead author or the lead organizer of the massive paper that a lot of us contributed to on foundation models recently. And I think that's been a focus of Rishi's research before then and ever since. And that'll be one of the main things we talk about today.

So, welcome, Rishi! I have a bunch of questions for you. I thought I would kick it off in a way that's related to Bommasani et al 2020. So, as you might remember, because you're a distinguished former course assistant for this course, we just did bake-off 1. And it was word relatedness, so they get pairs of words and a score. And the idea is to learn to make good predictions about relatedness in that general sense for word pairs. So you Rishi, in 2022, what would you have done for your original system for this bake-off?

Rishi Bommasani:That's really interesting. I haven't thought about this since last year.

Chris Potts:That's why I asked first. I wanted to catch you off guard. I have easier questions for you.

Rishi Bommasani:Yes, absolutely. No, this is wonderful. Do I know anything about what words am I trying to embed, rare words or frequent words? Because it might change things quite a bit.

Chris Potts:Good question. I think it's a good mix. There's a lot of frequent words, but some rare ones, let's say. I know that can be a little touchy for your data-intensive aggregated method.

Rishi Bommasani:Yeah, that's right. I think there's actually been a lot of follow-ups on my paper, which I think are much better in terms of doing this, in terms of the performance on word relatedness and word similarity datasets.

I think I'd probably, just to start, probably do something similar to what I did two years ago or three years ago, when the work was being done, of aggregate over many contexts. This definitely seemed to be very helpful.

I think, in terms of what model I'd use, it's not super clear to me. I think maybe I just started with RoBERTa as a baseline or T5 as a baseline, but I think that's probably where I'd start.

And then it seems like there's actually quite a bit of action there. How you post-process the embeddings might be quite useful, or what similarity metric you choose. I think, in the paper, I try to be pretty non-committal of using cosine similarity, which was pretty standard, and not doing any kind of post-processing, but by having the benefit of seeing a lot of what the students did last year, it seems like a lot of things that were better – adding post-processing on top or thinking about distance metrics that are learned rather than cosine similarity – all these things can be quite helpful. So that's probably what I would do.

Chris Potts:Yeah. It can be tricky to get right, but I'm always optimistic that students will do something cool with their representations, like use your aggregated approach and then learn a distance function, because it just seems like there's extra juice you can get from the dataset from learning that function. But as I said, it can be tricky to get it right, but I like that. But that's good. You're still fundamentally an adherent to Bommasani et al. 2020.

Rishi Bommasani:I guess so. I think there are a few papers I might cite preferentially to my own, but yeah, I think the core method is roughly derivative with that one.

Chris Potts:So I'm actually curious to know about the papers, but let me just step back a second. I'm curious to know the origins of the work.

So you were an undergrad at Cornell. And it's the sort of paper that someone very seasoned in the field might write, because it's a simple approach, the paper is oriented toward testing it robustly, it solves a bunch of problems. I guess you can hear that I'm a little bit envious of you having written this paper. Where did the idea come from?

Rishi Bommasani:Yeah. So I think this aesthetic of trying to do simple things and thinking about baselines is definitely one (A) I learned from my undergrad advisor, who's Claire Cardie, who's been in the field for a long time and definitely believes in doing things in this kind of way and (B) it's definitely an aesthetic I admired in Danqi Chen, who was actually a PhD student here and is now faculty at Princeton, who definitely embodied this in a lot of her work.

And I think there are two motivations. So this paper has some sentimental value to me, as my first conference paper, and it has two motivations that are both pretty interesting and pretty different from each other.

One is this idea that, as someone who did some amount of theory at Cornell, one idea in theory is this notion of a reduction – of taking some problem and reducing it to a problem that you already know to solve.

And in NLP, we were just seeing contextualized representations emerge. They were clearly doing a lot of interesting things. We saw, in that time, the first wave of papers that tried to do analysis on them. BlackBoxNLP, which is a workshop thinking about analysis, was emerging right about then. And we were, by and large, building a new suite of techniques to analyze the representation we were producing, which is totally reasonable. These representations are very different from word embeddings, because they're contextual, but it seemed pretty natural to me, at least, at the time, that we'd already built so much of the infrastructure of the field for the past four or five years around word embedding as a way of thinking about things.

And in particular, we had these word similarity tasks, or word relatedness tasks and it seemed natural to try to reuse that machinery and work that the community had done for almost five years. And it also made it more natural to compare the two representations, if we wanted some representational level of comparison between word embeddings and contextualized representations, because their type signatures are different.

So that's one class of motivation. It's about reusing and not reinventing the wheel and reusing things that the field had. And I think I can definitely spiritually attribute that to people like Claire and Danqi, for doing that in a lot of their work.

And the other side of the coin is, through Twitter, which is interesting. I didn't know Ted Underwood, who's a faculty in Computer Science and English at UIUC, but Ted is a really interesting person who was on Twitter talking about ... He has a blog post, which I think is cited in my paper as a footnote, as a clear inspiration for the paper in the sense that digital humanists and computational social scientists have used word embeddings for a long time before word2vec and GloVe, but certainly after, as well.

And right around then, Sudeep Bhatia, who's a computational social scientist at UPenn, gave a talk at Cornell about using LSA and other things for his work in psychology. So I was seeing Ted tweeting about how it would be nice to use BERT, but he couldn't seem to get it to work and it wasn't useful for the types of things he was studying. I saw Sudeep 's talk when he came to Cornell. And even though I had not really done any work in the digital humanities, I just observed that it seemed like something new was happening in our field of NLP and it seemed like other people made use of the things we had before an NLP. So it would be nice if they could also use BERT or ELMo or all the things around them.

And so that was the second motivation: can we take our representations and make them, basically, again for different reasons, but again, have the same type signatures as word embeddings so that all of these other people who might want to use them in entirely different scholarly communities could do so in, hopefully, a pretty easy-to-use way? So those are the two motivations. I think they're different, but nicely coalesced in this.

Chris Potts:Absolutely. Yeah. The second one really resonates with me, and I had a hunch that that might be true. And you mentioned all those connections, but I was thinking just being at Cornell as an NLPer might make you feel very connected to computational social science because they have so many people doing that.

Rishi Bommasani:Absolutely.

Chris Potts:And it's just this sort of thing where, yeah, as you said, we were all rushing, "Oh, it's exciting. We finally have contextual representations," whereas these adjacent fields had just come to terms with thinking of words as representative vectors. We had just convinced them to move away from LIWC and unigrams and, now, we pulled this trick of saying, "Oh, we don't use those fixed vectors anymore." And they're like, "Yeah, but I was just figuring out how to state hypotheses in these terms, using my theories." And so this is potentially the best of both worlds and I think that's wonderful. Yeah.

Should we be doing more of that kind of thing? Maybe even in the area of BERT, they might want perplexity scores and we're not really telling them how to calculate those. We're leaving it to them and things like that. Are there other areas where NLPers should be looking in that direction?

Rishi Bommasani:I think there definitely are. To be clear, I think one thing one has to be careful about is making sure that the tools you're building are actually useful for other people, and that we're not imagining use cases that don't exist, and that people in other fields would actually genuinely want the things we're doing. I don't think I had such a clear understanding back then, but besides that, I think there are definitely a lot of these use cases.

I'm advised by Dan Jurafsky, and a lot of his past work has thought about this in a variety of domains of applying the things we're seeing in NLP to ask computational versions of questions that have been asked in the social sciences for awhile.

And it seems like perplexity could be another thing that's useful, even for models where it's straightforward to compute the perplexity for someone in NLP, providing that as an easy-to-use package would be a nice service and further, for models where it's not as obvious or clear how you get a perplexity, I think that would be also pretty nice.

Chris Potts:A lot of those fields had used unigram-based or n-gram-based language models for awhile to estimate the probabilities of text. And then, with an autoregressive model, I feel like you can kind of say, "Well, you can continue on as you were," but then the wonder of BERT is that it pulls this trick of having bidirectional context and then everything is less clear than maybe you tricked yourself into thinking it was.

Rishi Bommasani:Yeah, I think that's right. That's right. Actually, one of the things that I had hoped to work on, I've never really gone after, but I think would be cool, is if you take all of these models, you take an autoregressive GPT-2 kind of language model, you could generate a bunch of texts using it and then fit some other language model, like a classic n-gram language model, to it. And I'd be curious how that compares to if you just train an n-gram language model on the original corpus, if there's some kind of smoothing effects that GPT-2 is doing that make it different from some of the n-gram LMs before in a useful or interesting way. It's not clear to me if that's at all useful for social scientists, but it might be for getting more reasonable estimates of certain things.

Chris Potts:Yeah. Very cool. And so you mentioned some follow-on work to the 2020 paper that you found inspiring. Did you want to say a bit more about that?

Rishi Bommasani:Yeah. There's actually been a lot of different kinds of work. So I think the paper gives a nice, simple method that's been used in a few different ways.

One of the ways, which exactly as you presented in the course, I think is, in the paper, one might notice that there's no invocation of the language of distributional semantics or vector space models to any great extent. (A) partially because I wasn't super familiar with that kind of writing, but (B), what I think many people asked after ... or some people have asked afterwards is a very natural question of, if you think of this as a model for distributional semantics or compare it against other vector space models, how is it different or similar and done some kinds of further error analyses?

So I think those have clarified, partially, where is this method maybe good and maybe, in some ways, why might word similarity tests overestimate the benefits of this method over other methods for building vector space models? So I think those are nice. I guess, in some sense, they were negative about my paper, but I think were nice in clarifying how things are going.

And then the other line of work ... I think there's a paper from Jaggi et al. and a few others that I remember just simply improve the methods and think about them more extensively. In a lot of places where it wasn't clear what decision to make, so the decision I made in the original paper was just a non-committal one of using cosine similarity or only averaging with no post-processing or other things, they actually think about, can we make other decisions that make more of a commitment to doing something and does that improve performance enough that it's the case that you can actually improve the method quite a bit for certain things, if you make different decisions than I did?

Chris Potts:Interesting. But just for the record and for students who joined late, Rishi, what he would've done for the bake-off is his aggregated approach, I assume mean pooling, but he would've used a slightly larger model, RoBERTa. And I'll be curious ... If anyone out there tried that, you could let us know in the chat whether that was, in fact, a blazing success. Oh, it was a learned distance function because: word-relatedness, kind of idiosyncratic, as you might think, so let's just specialize to that with some of the parameters. Maybe we'll do better. Well, I'll let you know what happened this year. I'll open the leaderboard right after this meeting and check.

Rishi Bommasani:Sure! Sounds good!

Chris Potts:Let's shift gears a little bit. I want to move into the realm of foundation models, I suppose, and talk about everything that happened last summer with the paper and the workshop and everything.

Rishi Bommasani:Sure.

Chris Potts:But let's go a step back from that because the Mistral effort, the internal effort, that predates that. So do you want to say a little bit about the origins of the Mistral project?

Rishi Bommasani:Yeah. That's actually a very interesting story. So its origin began at the tail end of 2020 or in the winter between 2020 and 2021, so quite a bit before the foundation models paper came out. I was not involved at the onset, but this crazy idea from Chris Ré, who had some of these exciting ideas, was why don't we try to build GPT-3 in-house at Stanford?

Chris Potts:Yeah.

Rishi Bommasani:Which is an idea, for sure. Somehow, he looped Percy into it and the two of them got excited about it. And they had students who were also interested in thinking about ...

I think I actually remember Chris, you mentioned this at some point, that with this whole foundation model trend and language model and so on, it's been like industry has taken the field by storm and academia has, in some ways, played second fiddle. And I think this was part of an internal question of (A) can we understand how these models built because we don't actually understand it, because we're not the ones doing it and (B) can we maybe reclaim some of the ground we had lost?

Perhaps it's unnecessary competition, but at any rate, this is where it began was, "Let's first replicate GPT-2," which we did not think was going to be very hard at all. It was meant to be a soft goal and we would move on.

Chris Potts:I have to ask, wait, were you at that workshop with HAI and the GPT-3 folks?

Rishi Bommasani:That was the first month I was at Stanford, so I was actually not at that workshop.

Chris Potts:I lost track of when that was, but it just reminded me because I was there. And I'm not supposed to reveal who said what, but let's say the OpenAI perspective at the time was, "Why can't you guys just do this? We did it. What's the problem?"

Rishi Bommasani:That's right.

Chris Potts:And maybe that was Chris Ré's thought, too, like, "Well, surely, if they can do it, I can do it."

Rishi Bommasani:Exactly.

Chris Potts:So the original vision was GPT-3 – "Let's just, more or less just directly replicate that."

Rishi Bommasani:That's right.

Chris Potts:And then GPT-2 was a warmup step?

Rishi Bommasani:Yes, that's precisely right. And it clearly took us a long time. It took us maybe six, seven months to replicate GPT-2. Even still, we're trying to replicate the largest of them.

Chris Potts:What's the parameters, do you remember? What's GPT-2, parameter-wise?

Rishi Bommasani:The smallest model on GPT-2 is 124 million. The largest one is 1.5 billion. So we started the smallest one.

Chris Potts:And GPT-3 is 175 billion.

Rishi Bommasani:Yes. So it's a thousandfold larger than the smallest one and a hundredfold larger than the biggest in the GPT-2 suite. So exactly. As we were striving for that, we were like, "Oh, of course, we'll figure out GPT-2 along the way." So Sidd Karamcheti in Percy's group (he's also advised by Dorsa Sadigh) and Laurel Orr, a post-doc with Chris Ré, championed the effort of, "Let's do this. Let's build an infrastructure," and I joined in somewhere along the way.

But I think what we learned is unlike almost all other things we see in NLP and ML and other fields where we're increasing reproducibility, a lot of the code at the time was not out to train these models. Many people are familiar with Hugging Face, but Hugging Face is really good for fine-tuning models, but not really training them. And at the time, its code had never been tested for something, even at that scale, which is a thousand times smaller than GPT-3.

And so we learned pretty quickly that, even though we understood all of the ideas of a Transformer and autoregressive LM, how to implement it, all of those things were not really the bottleneck. We had to build a code base, but the bottleneck was really a lot of stuff dealing with the stability of training and things we did not realize. So we were doing things wrong, per se, but if you're at a small enough scale, you don't realize it.

So let me give you an example. I'll talk about the two scales we operated with for awhile in trying to build up the code base, which is 124 million, it's the smallest GPT-2, and 355 million, which is the GPT-2, the medium, the next size that they looked at. So we were just trying to reproduce the paper. We had very similar data and so on and we had a lot of people that are really good at ML engineering, but what we learned is you could train a bunch of models with different random seeds that will produce similar perplexities to the GPT-2 small and all evaluations we had looked identical.

As soon as you scale to the medium model, some of the models will just crash. So what does crash mean? It means that, at some point during training, the activations will blow up. The model will become very quickly unstable. It'll perform very poorly very rapidly. It'll crash in any sense that a model can crash. And this is already importing a lot of ideas at the same time from NVIDIA and Microsoft to make model training stable.

We were very fortunate to have Deepak Narayanan, who was a student of Matei Zaharia, who does systems and who had trained a trillion-parameter model for a very, very, very small time, a long run. But as a proof of concept, as an intern at NVIDIA, showed that you could momentarily train this massive model. So he really knew what he was doing and so we learned a lot from him.

And they started crashing just at the medium size, which is still 500 times smaller than GPT-3. And what we learned is, actually, they would've crashed at the small size, too, and even at smaller sizes, if we ran them for long enough. We were just lucky that we didn't run them out for long enough so that we saw no issues. And so we thought we could move on.

And this became the entire focus: in some sense, is how do we understand train stability, which is not well-documented in any of the papers? It's not even a topic mentioned in almost all of these papers. And I think, certainly, most people in NLP, despite how many ever thousands of papers use these models, most people probably have no clue that they are this unstable because. Certainly, if you fine tune BERT or something, it's not unstable. It's only about pretraining.

And that was the story. It's like we had to learn all of these tricks. We were looking at all kinds of code bases. It turns out one very important trick that people don't have a principled explanation for yet is this: In self-attention, there's this matrix computation between the key and query matrices where you multiply them. You want to do that in full precision, but everything else in half precision. Generally, you want to do everything in mixed precision, 16 bits, if that makes sense, because that'll let you scale your model and have a smaller memory footprint. But if you do that for attention or attention computation, the same thing that is Vaswani et al. – the Transformers paper – that will be very unstable. So you have to figure out some way to make that one computation stable and a bunch of different people had a bunch of different hats, still no real principle.

It's been five years since that paper. I don't think anyone has a principled approach. It could be principled. I don't think there's any reason it has to not be principled, but there's still very little work that even acknowledges that training is so unstable, and very little work that's principled in figuring out how to make it stable. So that was what we learned. We learned and we talked to a bunch of people at all of the different places training things, and learned everyone had their own hacks, but weren't writing them in their papers for some reason. It's not really clear why that needs to be the case, but this is at least the story of how things went.

Chris Potts:That's interesting because we've, by now, seen really massive models from a number of different groups from all from large tech companies, essentially. But also Eleuther has trained ... I mean, it's very impressive what Eleuther has achieved, 9-billion-parameter model, GPT-J. So is everyone just painfully learning all these tricks? Mistral was an effort to share some of these lessons in a systematic way. Right? And that's great, but still, this must be just one layer of the many things you need to have figured out already.

Rishi Bommasani:That's right. I think we've definitely talked to some folks who've trained some of the largest models. And I think, at each factor of 10, roughly, there's a new challenge or a new trick. I mean, different people do a different thing to resolve the issue, but you run into the same fundamentals that training becomes instable, yet unstable again, do something. Eleuther's work, I think, is great because their code base is open. That's actually one of the ways we discovered our issue, is their code base is open and we realized some very small difference between the two. And that's how we noticed the issue. So I think this idea of making the training process and the training code and everything very open and transparent is pretty good to avoid new actors in the space having to rediscover all the tricks and so on.

There's also some things that are not just about the code, but about what you do. So the example, which in Google's most recent paper called PaLM, which is this 540-billion-parameter model, they mention – I think it's one of the first times it's been mentioned, but been known silently in the community – is for these very large models – 100 billion parameters plus – it's pretty essential that at some point, even with all of these different potpourri of fixes, they still become unstable. This is one of the issues at such a large scale, is we can't do things systematically and so we don't actually have a very good understanding of many things, which some might argue is a reason not to do things at that scale. We can talk about that.

But regardless, what they've learned is, at some point, the model would crash. So what they would do is they would save all the checkpoints and then just, literally, rewind it to some point before it crashed, change the order of the data, and then run it again. And usually, it won't crash again.

Chris Potts:Oh, change the order of the data?

Rishi Bommasani:Yeah. And then you'd just repeat this. And this is necessary. As far as I'm aware, even in private knowledge I have, I don't believe anyone knows how to train a model of that scale that doesn't require this hack of just let it crash, rewind it, change the order of the data, and hope that it'll improve. It's a strange affair, especially for some of these largest models, to train them stably.

Chris Potts:Fascinating. So reflecting on this, you all actually did ... You were modest about this, but you trained lots of GPT-2 models, dozens and dozens of them, to study variance.

Rishi Bommasani:That's right.

Chris Potts:And that might be something we return to, but that was a major effort from a non-tech giant.

Rishi Bommasani:Sure.

Chris Potts:And GPT-J, is that the biggest private model that exists right now, as far as we know?

Rishi Bommasani:Or public, you mean?

Chris Potts:What I mean is ... Oh, yes. I mean, non-tech giants.

Rishi Bommasani:Oh, sorry. Got it. Yeah. I think the largest non-tech model that exists is ... I think there's two from Eleuther. One is GPT-J and one is GPT-NeoX. The newest one, NeoX, has 20 billion parameters, even a little bit bigger.

Chris Potts:That's right. Okay.

Rishi Bommasani:And then on the horizon is this model being trained by Big Science, led by Hugging Face, which is the GPT-3 scale, but that is still being trained.

Chris Potts:And they are having to do some of what you described, right?

Rishi Bommasani:Exactly.

Chris Potts:Like recover from catastrophic failure?

Rishi Bommasani:Yes.

Chris Potts:But how big is the target for that?

Rishi Bommasani:It's playfully one billion parameters more than GPT-3, so 176 billion parameters.

Chris Potts:I love it. That's good.

Rishi Bommasani:But I think that one is going to be exciting because it's multilingual and it's the most intentional process, I think, that exists so far for data selections.

Chris Potts:Oh, yeah, but that's another fascinating thing. Okay, so I'll just express my skepticism that, as we go ... I don't know what the magic number is, but 100 billion, 200 billion, all the way up to this 540, it'll be diminishing returns unless something really interesting happens about the incoming data. And if everyone was using fixed data for that, it would, at a certain point, almost by definition, be pointless and we might have long ago passed that when we're talking about 540 million.

Rishi Bommasani:That's right.

Chris Potts:Is that your feeling, as well? So larger might, indeed, be better, but not without real creativity about the data.

Rishi Bommasani:I think that's right. I think it definitely feels like data and being intentional in how data is selected seems like, even if scaling helps, especially in these scales, scaling is really hard. We're doing things to really push any systems. We're pushing the system is really hard and spending a tremendous number of resources to do it. And thinking about data certainly feels very under-explored. And certainly, my belief is pretty optimistic that we could do things that are a lot more clever about using data or other ways of thinking about data.

For example, DeepMind – this isn't exactly what I mean – but DeepMind has a paper called Chinchilla (I guess we have some very interesting model names in the space, too), which is 70 billion parameters, so very big, but much smaller than other things, including their own 280-billion-parameter model, but you train for four times as much data, and that turns out to be a better allocation of compute. Of course, you need four times as much data.

Chris Potts:That's a new paper. Right? I'm going to keep track of these links. That's a new paper that just came out, right?

Rishi Bommasani:Yes, yes.

Chris Potts:I was really glad. That answered a question that had been on my mind for a long time. That's great. Good evidence.

Rishi Bommasani:I think there's a question a lot of people have that hasn't been answered, which is almost all of these very large, 100-billion-plus-parameter models are trained so that they never look at the same data point twice, or at most twice sometimes, but compared to what we have done in a lot of other areas where we train for many epochs, that has not been the norm here. Usually, we train for one epoch.

Chris Potts:That's something I'd overlooked. That's interesting.

Rishi Bommasani:Yeah.

Chris Potts:It's true, even when you reflect on the RoBERTa paper. Right.

Rishi Bommasani:Yeah, exactly. So I think it seems hard to believe that there are not benefits just from multiple epochs on the same data, but I don't think anyone has done either at the large scale yet.

Chris Potts:So data size, model size, and just the sheer amount of time you train are all independent axes to explore.

Rishi Bommasani:Yeah. Mm-hmm.

Chris Potts:I want to talk about foundation models, and Dhara put a nice reminder in the chat about what that terms means, and I'll ask you to talk a bit about it, but you made an aside about the environmental cost. I want to put you on the spot a little bit. I formulated a very careful question here:

Some industries and disciplines are on a path to having a smaller negative environmental impact and some are on a path to having a larger one. Where is NLP along that continuum?

See, I'm being careful, because I'm not saying that it's bad or good.

Rishi Bommasani:Sure.

Chris Potts:Everything we do has costs, but I do have an optimistic feel that some industries are getting better, that is, reducing their footprint relative to what they achieve, and some are getting worse. So, where does NLP fit, and foundation models is going to be some kind of inflection point, I'm guessing, here.

Rishi Bommasani:Yes. I think we might be getting worse, but I think, first of all, the reason I think we're getting worse might be different than what other people think because, certainly, the largest models we train require a lot. Now, I guess we could be a little bit precise here of whether we mean carbon emissions or just energy expenditure because a lot of them try to work on pretty clean energy resource.

Chris Potts:That could be part of it. We can do that and then that implies centalization, I think, otherwise, you won't get the savings you could have. And that relates to foundation models.

Rishi Bommasani:Right. And I think we can change this quite a bit. Let's focus on academia or on the research community because we have more visibility on it, perhaps. I think, there, probably, if I were to guess (I don't exactly have the numbers to back it), more of the energy usage is probably from all of the different labs fine-tuning things a ton of times, fine-tuning all of these models. It's probably not in the one-time cost of training one model. BERT is trained one time, if we talk about an artifact that has been used in thousands of papers, but has been fine-tuned so many times – tens of thousands of times, at least. Right? Sure, I would certainly believe that is where a lot, because it is repeated fine-tuning, is where a lot of the energy expenditure is going.

And the point is, actually, just like a pre-training time, not everyone needs to pre-train. We can used an artifact produced by someone else, hopefully. At fine-tuning time, many times, for example, in many academic papers, if you want to study how does something on SST-2 change or on whatever data, or SQuAD, or whatever the dataset is, you could probably use these fine-tuned models someone else used in their prior work, as long as we store them. And we could totally do that. I think it's just not the norm yet, but I think that is one way where, the reason there is more carbon expenditure is we're just repeating work that doesn't need to be repeated. It'd be faster and better for the environment if we just didn't repeat that work and we used the fine-tuned model for whatever analysis, whatever comparison you want.

So I think that, to me, seems like potential if we can get the field to buy in and do that reliably. And the other is ... I'm positive because the field has ... I think I actually learned this from you, Chris, or I think you made this point, that I think the NLP field is very good at being introspective and reflective. And I think there's a lot of people that are championing green NLP, green AI, kinds of movements, from a few different places. And I think that will drive change.

I am pretty confident that, if our field takes this seriously, it does have the will to change its practices if we can agree, "This is how we should bring more attention to efficiency instead of accuracy," perhaps. And that maybe is something that needs more priority, and then you have the people that would be willing to lead the charge to do that.

So I think, in that sense, I'm a bit optimistic that our field will ... I don't think, at the moment, it's in the category of decreasing its energy usage, but I think it can get to that category in a believable trajectory.

Chris Potts:Yeah, totally. I don't know whether my general feeling about the planet and humanity is optimistic, but within NLP, what you just described, there's so much potential there because what you described is like the last-mile problem for transporting goods and things, where you could blame Walmart because it has a lot of trucks, but in fact, the trucks are hyper-efficient and centralize a lot of the movement of the goods. And then the real pain point is the last-mile problem. And this is kind of like, in my analogy, these singly-trained BERT models or GPT-3 are like the Walmart trucks. We're paying a huge cost by all of us fine-tuning for the last mile.

Rishi Bommasani:Right.

Chris Potts:But then I was thinking, "Well, increasingly, we actually distribute those artifacts and we do it with the help of Hugging Face," and then maybe I could do my part by having more problems in a class like this, where you don't have to fine tune.

Rishi Bommasani:Sure.

Chris Potts:You can download a pre-trained model. And then interesting part is what you do after that. We have a great question here from Nicholas in the chat. I just want to ask this now. This is really nice because this might come up again. Nicholas is saying, "Well, the writing is on the wall that these BERT models are going to be important and foundation models, in general. Where can we, as people who aren't at the big tech giants, make contributions, given that we don't have infinite wealth and infinite people?" Are you looking for areas like that actively? Is that kind of question on your mind?

Rishi Bommasani:I think so. I think I'm definitely interested in this question. First, I think it is important for us to, if we want to do research on the same things, if that's what's interesting, and it's certainly interesting to some people, I think we should figure out how to do that research. If scale is an important part of the research agenda, we should figure out how to do that.

But putting that aside, I think it's definitely the case that a lot of the work for making these models work well for specific tasks, or understanding how they work, or the social concerns, or whatever, all seem like problems where we're not competing. I think, of the questions that are at least presently in the field of NLP, only a few of them require, for answering that question, you train some massive thing. You look at the vast majority of papers at ACL, they're written by people who aren't training some massive model. They's one BERT paper. There's not 500 BERT papers.

So I think the vast majority of the questions in the field are not tethered to scale in an important way and, I think, can be done in all of these ways. I think, also, these models sometimes allow us to answer or ask some really interesting questions that are new. For example, when I came into the field in around 2016, it was really hard to get really good word-level representations. This is something word2vec and GloVe helped with, but there was still a lot to do. If you looked at the number of papers, I think, to the semantics track of ACL, I'm sure it was quite high with papers that are trying to do this – build better word-level and sentence-level representations, which is cool. And I think it's still important and I don't think it's solved, by any means, whatever "solved" means.

But one thing these new models let us do is ask more questions at the pragmatic level, or the discourse level, or other levels of linguistic abstraction, which I think is pretty exciting. To be fair, there were people working on it in the past and you could ask questions, but if the models are just abysmal, maybe the answers are not that interesting. So I think that's a pretty cool thing, is that I feel like syntax and semantics get a lot of the hype in NLP for work on linguistics, but I think we are seeing other parts of linguistics get more attention for work of this type. And I think that's cool.

Chris Potts:That's what I say to my linguist colleagues. This is the most exciting moment ever! Previous moments were boring for you in NLP. You hear this cynical message that this is the worst moment for this because, I don't know, it's all engineering, but no. You can ask more sophisticated questions than you could ever ask before about language in terms of these models. And all these questions of poverty of stimulus, all these things newly on the table, newly controversial and exciting. Yeah, absolutely.

Rishi Bommasani:That's right.

Chris Potts:That's so cool.

Rishi Bommasani:We see work in these days on garden pathing effects or other pretty complicated linguistic effects. I guess you could have some of a trigram language model, butI don't think there would be too much to understand there. So I think that's pretty cool there.

Chris Potts:And that's giving rise to one of the most exciting sub-communities of ACL, which is computational psycholinguists who are figuring out, "Okay, I can ask a human question, but now using this model, and the advantage of the model is I can do brain surgery on it 100,000 times or I can change the environment in which it grew up and see what happens to its bias for producing some modifiers," or whatever, things you could never enter with humans. As you said before, with an n-gram language model, this would've been apples and bananas, but now it's like, yeah, this seems like a reasonable thing to be making hypotheses about. I just love that.

Let's talk a little bit about the foundation model stuff. So that came after Mistral and it's along the lines of Nicholas's question, which is what could we, as academics, do that might be hard at one of the tech companies? That's my read on at least one of the initial impetuses for the effort. So first, maybe just because – Dhara was right to prompt about this – maybe say what you mean by a foundation model and then we can talk about the paper and the response to the paper and stuff.

Rishi Bommasani:I think, in the paper, we provide a definition for foundation models. Roughly, the definition we provide is that these are models trained on large amounts of data that are quite broad data, for some sense of "broad", and that are intended or that are used for a wide range of downstream applications by adapting them in some way. And at the moment, we see that most of these models are constructed through deep learning with some supervision as a class of objectives.

That could change, but I think the way I think of it, at the very least, is that that is maybe a reasonable technical characterization of these models. I think they also are maybe equally or, perhaps, better categorized by a conceptual shift in how we think about things or a sociological shift in how work in AI has been done. We have this general infrastructure that is useful and, now, let's think about how we can adapt it for different things and do different tasks or achieve different purposes using this model as the artifact that we're adapting.

Chris Potts:And a new name was needed because ... I'm a big believer in this, but you fill it in. Why couldn't we just call them language models?

Rishi Bommasani:Well, language model – it's not only about language. The analogy I like is with computer vision: I think we often think about deep learning as popularizing in vision, the ImageNet and AlexNet. Chris Manning often points out it first happened in speech, but I think, in the popular narrative, we associate deep learning as beginning in vision in its recent resurgence. And in the same sense, I think we associate these models with language in the present time, but I think in due time, we'll see it as belong to machine learning. It's not just about NLP and not just about language, but language will be the place where it was pioneered and we'll see it brought into other modalities. And I think we're already seeing this.

Chris Potts:Yeah. So DALL-E is a "language model". Codex for code is a "language model". That's helpful. Got it.

Dhara Yu:Yeah. We actually have a question from a student, Annelle. The question is about multilingual NLP/NLU broadly. Maybe in the vein of these giant foundation models, how are researchers making efforts to make sure that NLU/NLP advances are occurring across many languages as opposed to just English? And maybe that was actually a topic that was covered in the foundation model report, if I remember correctly. But yeah, that's the question.

Chris Potts:Yeah, great.

Rishi Bommasani:Yes. I think that's a great question. So actually, yes, first of all, in the report, Isabel Papadimitriou and Chris Manning have a nice section on language that specifically thinks about multilingual LD and language variation. I think that's a very important topic here.

I think there's a few things to say. First, I think we have seen this interesting trend towards multilingual models. There's the multilingual BERT model, and there are many others that have since come out from different places, where you train on the union of data – of text – from a variety of different languages. This seems to be pretty reliably useful, in the sense that this is often better for low resource languages, for the performance for downstream tasks for those languages, than if you trained in a monolingual fashion only on those languages. The jury's still out – even if you train on a typologically similar class of languages to the language you're trying to study.

There's sometimes some benefits from languages that, even from a linguistic perspective, might be quite far away. The reality is we have very different amounts of data for different languages, and other linguistic resources are also clearly not evenly distributed. So I think that is a good thing, that we're seeing this positive transfer, and that this might help allow us to build better NLP systems for languages, where we, at present, don't have as many resources. I think that's great.

On the other hand, I do think a fair criticism is that a lot of the attention is placed on English. I think this is not really unique to foundation models. I think this is just a trend that we have seen in NLP for a while and thought about and would like to change.

One thing that we are seeing is there's a lot of things that, for some languages, when you juxtapose them with English, make it much harder to do NLP for that language. One is about the volume of data. In particular, supervised data, for many languages is, at least in academia, is far less available than for English. Relaxing the requirements to only require unlabeled text might help. Another thing is, standard evaluations aren't there for all languages yet, especially at the level of nuance that we have for various fine-grained NLP tasks in English or in other higher-resource languages. One thing I do like about foundation models is that they allow us to reallocate attention in designing evaluations, perhaps away from building really giant evaluation data sets to building interesting evaluation data sets, where we pose the problem in a few shot manner.

If you compare to data sets like SQuAD in NLP, there are hundreds of thousands of examples in SQuAD or a hundred thousand, most of which is allocated for training. Then there's some test sets that have a few thousand examples. You could imagine a future where we design evaluations that are much more multifarious in what we evaluate for, and maybe much more fine-grained, because we don't have to take as much effort to build this massive training data set for each downstream task for a thing we'd want.

So I think that has important ramifications for the multilingual setting, in the sense that it might make it easier to study all of the things we are interested in for many languages. I don't think we're going to realistically find the resources for every language we want to get massive training data sets for every task, but we might be able to build just the evaluation portion and pose the task in a few-shot way. That might let us make these kind of goals we'd have for multilingual NLP more realizable, by having the evaluations you want. There have been a lot of cool efforts in the past year to build evaluation datasets for more technologically diverse languages and pursue a larger collection of languages in this multilingual way.

Chris Potts:Yeah, Rishi, since the title of the paper is something like "On the opportunities and risks of foundation models", I just want to single out one of the opportunities you just highlighted there. This is relevant for student projects in this class. Ten years ago, if you came to me and had some interesting, creative new problem, I almost had to stop you in your tracks when I asked, "Well, how are you going to get 10,000 training instances, and then another thousand for assessment or whatever?" That would be the table stakes for you even getting to ask the question in the context of this course.

Now, I feel like I can perfectly well say, you need a hundred, or ideally a thousand, assessment examples. Divide that up into dev and test, be responsible as a scientist, but you don't need training data anymore. In fact, part of your research question could be seeing whether you can survive without training data. That should open the door to doing much more than just SQuAD or NLI or all these usual tests. Because, if you speak Turkish, come up with a few hundred Turkish questions and now you can develop a few-shot question answerer for Turkish, for example.

Thank you for that question. That was a great direction to head in, especially for people thinking about projects.

Let's return to that report, "On the opportunities and risks of foundation models". How did you get the job as the lead offer on that massive tone? Did you volunteer or was it thrust upon you?

Rishi Bommasani:It's a story, again. Maybe I'll keep this one a bit shorter. If you recall, there was this Mistral thing where we tried training models. That happened starting in winter and continued on for a while. In parallel, what we learned as we were training models is that there are a bunch of people at Stanford interested in these models, from a lot of different places. I think this is one other thing I'm very proud of, even though it wasn't very intentional, is that we built this community, which is now called CRFM, that brings together all these people from different disciplines, thinking about the same underlying models for very different reasons, but they will still think about the same model.

We had this community, per se, or at least we had this giant list of people, like there's a mailing list or something. We internally knew we had a lot of expertise. I think it was, if I recall, I think the very first suggestion was from Percy for each of the different areas – there's a lot of different areas; and if you read the report, there are 26 sections – to just write a so-called one-pager. It did not turn out to be one page.

Chris Potts:No one did one page. That was the first bit of feedback everyone had, I'm guessing.

Rishi Bommasani:Yes, absolutely. He tried anyways to keep it going anyways. Each section wrote one page, thinking about the problems, how they're connected to foundation models, what it is they were thinking about. As a survey, but more so, I think, one of the goals at that point was just to even understand what everyone was thinking, what we understood, what we didn't understand. I think at that point, because we were writing all of this, it seemed like we would eventually write it into a paper, but it wasn't clear what form yet. And then slowly over time, as people put more effort in, it became clear that this should be something that merited a paper. I think from the onset, I was involved thinking about different parts and seeing the different sections have evolved.

It became clear that it should be a paper, that we should take it seriously and think about what it means for it to be a paper and how we can have a uniform standard for what we're going to talk about in each of these sections and link them all together. I think that was a key thing that I did, with a lot of help from a lot of people, is really try to find all of these interdisciplinary links. Because, I think that's the point, if you have such a multidisciplinary community, is to actually capitalize on that, because the links are definitely there. Clearly, all of these things are interlinked. It's more so a question of, can we pull them out and make them salient? I think that was a very cool part of the paper. I'm not sure in the history of science, there's that many papers that have such a range of disciplinary representation in the paper. That is cool, to see all of these different people coming together and talking about something.

Chris Potts:Oh, totally. We should talk about the Twitter thing, because that's where it first landed. Many aspects of this on Twitter were frustrating. One of them was the repeated line that "the whole Stanford CS department" had gotten together and written a paper, which wasn't true in either direction.

Rishi Bommasani:That's not true at all.

Chris Potts:Importantly, huge numbers of people with no CS affiliation.

Rishi Bommasani:Exactly. I think a fair critique we see in this computing a lot, or not a lot, but repeatedly, is that computer scientists take on some problem in a different discipline, don't really know what matters in that discipline and just do something. Totally feel a fair critique, but that's exactly not what this was. We have an economics section written by economists. We have medical section written by multiple medicine faculty. There are students in medicine and students in AI all across the board. We had the HCI section with colleagues in HCI. I think that was pretty cool. Whether Twitter appreciated it or not was a different question.

Chris Potts:I was the lead author on the philosophy section, so you're making me really glad that we had some actual philosophers elsewhere on that!

Rishi Bommasani:I remember when Thomas Icard, who's a philosopher, joined the section. I was like, "Wow." I think that was a repeated observation to me. It was really cool that all of these people were interested in these things and wanted to be involved. I don't know, maybe I have some pessimism sometimes. It seems like AI people do whatever they are doing. It's cool that other people wanted to be involved and that we could actually do it together and work on it together. The philosophy section is a perfect example.

Chris Potts:I think I begged Thomas to join. And John Etchemendy, at a certain point, was like, "I'm not really contributing enough." I was like, "John, please don't leave us. Please don't leave us. We need philosophers."

Rishi Bommasani:Excellent. Yeah. I think it was very exciting to see it all evolve.

Chris Potts:Sure. You spent a whole summer, in part, I guess, wrangling all these professors and students to produce this thing, which, miracle of miracles, they did. I would've predicted that thing would be years delayed. Already, this is some magic that you worked to get everyone to actually do this on schedule. Then it hit Twitter. I was there with you on Twitter, watching and trying to respond. It was a tough experience for me. I don't know what it was like for you, because I think on Twitter you hear mostly the negative voices and the positive ones are, possibly, just downloading the article. You could talk about that, or you could just switch gears and talk about this blog post that you wrote, which is partly a retrospective and a reflection on the process. What's on your mind?

Rishi Bommasani:I think it's the only paper I've ever written that has had overt negative criticism of that kind, which is interesting. I think I took it decently well, but it was definitely interesting in the moment. I didn't totally understand some of it. I certainly appreciated you and other people pointing out all of the ways it was a great thing that we were doing. That was cool to see. I think on Twitter, there were a variety of different critiques. I think some of them – and this is part of why we wrote this reflections piece – I think they were maybe instinctive to the concept of what we were doing, rather than being substantive engagements with the actual text that we had produced.

I'm not really sure. I think that there are a lot of reasons. I think Twitter is a hard place to navigate, because a lot of people are thinking things that they're not exactly saying. It's at times a strange way to have scientific discourse. I think it was largely good and well received, even if a lot of the positive sentiment, as you said, was not openly displayed.

We wrote the reflections blog post, thinking about we saw a bunch of different responses. How can we organize it and then think about, what do we actually think? Where, perhaps, are we being misrepresented? How can we describe what the agenda for the paper was and how we're thinking about things going forward?

I think that was very useful in clarifying a few things. For example, I think there are concerns that the paper doesn't discuss, say causality or symbolic methods or some other things. I think that was by intention, because the paper is not a survey of all of AI or all of some larger, super set of things. It's very specific to these models. I think that was one of the things I really tried to push for throughout the different sections, is a connectedness to these models rather than just expansively talking about all of AI.

I think that was just a misalignment, in terms of what maybe people thought the paper should describe and what it actually was setting out to do. One of the things that frustrated me, to be honest, was, I think the society section of the paper, thinking about the harms or societal implications, is actually very thorough and very incisive in criticizing the paper, in parts.... Not the paper, just criticizing these models, in part. I don't know if that was really appreciated. I think the blog post also provided us this opportunity to describe what the harms are, what also maybe some of the framing that we think is important.

For example, I think language that sometimes uses that the paper, or in general CRFM, projects these models as inevitable, which I think is an interesting choice of word for things that already exist. I think what we think is not inevitable is the trajectory of how they're developed. For example, we might object to the way the data sets that are used to train these models are constructed, and the lack of carefulness and attention in the construction of data, and the selection of data. I think that is something about the trajectory and how future models will be built that we can object to and take as not inevitable.

I think we made some of these slight points, is that we think that part of the reason for writing the paper is that we want to change, in certain ways, how the space is unfolding. To think, and most importantly, to let other people see how we're thinking about it and respond to it and make whatever decisions they want. I think that was actually very important.

Dan Jurafsky had a very nice point, which is the entire premise of what we did is very unusual, in the sense that, in general, across many scientific paradigms, most of them are not described in a paper. If they are, they're described at the end of the paradigm. There are some, like deep learning, swept over AI. Now maybe, or 2018 or something, people will write about deep learning and the impact it had or something like that.

We're trying to do the opposite. Some change is happening. We're in the middle of it, or in the early stages perhaps. We're trying to describe what is happening and think about it. I think that's a pretty strange position to be in. I think we want to really think about the paper as helping to orient people to the space and think about all the different considerations and help them navigate how they want to think about different things.

I think it makes the paper harder to read, because it's not replicating an established template of how to think about things. I think that's also what makes the paper quite interesting: it does what we thought was right.

Chris Potts:Let me share one thought with you to get your own reflection, perhaps. One thing that was surprising and illuminating for me is systemic about this. Which is, when we were doing the paper, I was all the time thinking of us as the counterbalance to the large tech companies. We're a small group. Even though we're well funded, our funding is a drop in the bucket compared to what groups at Google and places like that have.

So we're using our institutional force to balance out what's happening with those big tech companies. You could see on Twitter that that is not the public perception of Stanford. Stanford is lumped together with those big tech companies, as an institution that benefits from those companies. And we've benefited, in turn, from foundation models, and so this was just power cheering on power. That might be partly why the critical aspects, even in the title, got overlooked, because of this institutional thing that we are just like Google and Facebook. What do you think of that? Did you feel that at all?

Rishi Bommasani:I think that's definitely right. I definitely felt that sentiment underpinning a lot of the criticism, or specific types of criticism, we got. I don't exactly know what I think about it. First of all, I think it's certainly true that, from a variety of perspectives, we might be much closer to Google than to perhaps other less resourced academic institutions. I can totally buy that point. Certainly there are ways in which we're connected to Google, and not just Google, but many of these big tech companies. There are people that have been at both places and so on. I think that's fair. I think it is structurally criticism you might have, or a concern you might have.

I think one thing we could have maybe done better is clarified or acknowledged – the paper is not really a byproduct of any funding or anything like that, but maybe people are concerned about that and we could have said that.

Chris Potts:Yeah, definitely.

Rishi Bommasani:Yeah, we could say that. Beyond that, I don't know. I think the paper is a collaboration of a lot of students and faculty. I don't think the students are any different from students anywhere else. I think we're just thinking about interesting problems in our fields and writing about them. I think the faculty are no different really from faculty at other places. I think I can get the perception and I can understand it. I don't think it's appropriate or really is criticism we have any way of engaging with.

Chris Potts:I guess we're in agreement that there was something to this. Because, even though I think of myself as independent – I don't feel especially influenced by anything that Google would want of me or anything like that – it is true that we have all these ties with students, students getting jobs, money. So I feel like the best I can do is meaningful, which is to just try to act in the way that is consistent with me not to be encumbered.

I guess the frustration was that I kind of thought that the paper already did that, because it was not a glowing picture. As you say, the picture was more like, this is inevitable because they exist. We've seen the good, and that alone might mean that we don't want to abandon these models, even if there are some harms. We were just very forthright. I guess we just have to keep signaling that, because of the systemic aspect of this.

Rishi Bommasani:That's right. I do think we have a responsibility that we have to do that. We have a colleague from political science, Rob Reich, who talks vividly about it there, about keeping it separate from industry and having unadulterated critique of these models and so on, so, the actual text we wrote, I would definitely stand by.

So, to an extent, people have read that text... I think that's one of the frustrations is there's very little critique that seems to actually invoke any specific language of the paper. Then it wuld be the standard scientific discourse, right? There's some concern or issue with the text, and then we talk about it. Any of that would have been great. But a lot of the critiques seemed to have other agendas and be on other topics, so it's hard to engage on. It's also hard to engage on Twitter. It's definitely not the medium for certain types of discourse that requires a lot of contextual knowledge and so on, but I don't think we can just explain all of that on Twitter.

Chris Potts:That's wonderful. Here, so for the whole group, students out there, I have one more research question I could ask Rishi and then I have a bunch of fun questions, but we would love to hear questions from you. So, if you've got questions on your mind, get them into chat to Dhara and we'll try to get them in this queue, because I'd love to hear from you, and I'll fill in the gaps.

We have one great question. Dhara, is it this one about the Twitter? Yeah, this is perfect.

Dhara Yu:Yeah. Yeah, so we actually, thematically, have a question about Twitter from Sylvia. So, the question is to Rishi. "As a researcher, how do you deal with these negative comments with respect to your work, especially those that are streaming in by the dozen on Twitter?"

Rishi Bommasani:I think the first option is to just not encounter them, which I think most work does not encounter criticisms. But when it does, certainly I learned from Chris and Dan and many others that I think one should always be gracious with the criticism.

Chris Potts:Rishi, I was going to say that I've learned... I was impressed by how measured you were throughout all of that, so I think you're a great example here. Yeah.

Rishi Bommasani:Yeah, I think the other thing is, as Chris was just saying, I think sometimes the criticisms are often strange in the sense that the text often doesn't clarify why the person is stating the criticism, and so you have to often infer certain things and sometimes you infer it incorrectly, but I think to some extent it's just: you maybe try to ignore some of the criticism, especially if it feels not productive. I don't necessarily have any great wisdom there other than trying to overlook it or ignore it, because some of the criticism just felt below the belt per se, or whatever, and didn't really seem very appropriate, but I guess that's par for Twitter.

This is not a strictly scientific platform, and it's very strange that we conduct science on Twitter. It's definitely not a platform designed for science exclusively, but yeah, I think there were a couple of instances and I think at least in my case it's often because I know the other person that they are criticizing us, but there is some way to move the conversation to something more productive or trying to understand why they were criticizing us.

So, I think it's hard to do that on Twitter. It's much harder, in my opinion, than if you were talking to them at a conference or in person, where also I think people are generally more civil, perhaps, in certain ways. But, I think that's actually the one piece of advice maybe I have, is try to find those opportunities to actually make something productive out of the criticism, because I feel like most criticism on Twitter just goes unanswered, which I think usually makes sense, but we tried to find a couple of places to engage with the person providing the criticism and it seemed like there was some progress we had. I think those are rare, but things to try to make use of.

Chris Potts:But, Rishi, my question for you on this theme was going to be, is it smart for a junior scholar such as yourself to be on Twitter at all? And, I guess you could think as a reader/lurker, as a broadcaster, or as a replier.

Rishi Bommasani:Yeah.

Chris Potts:Those could be very different roles!

Rishi Bommasani:Okay. I think it's very dependent on the person and more so one's personality and so on. I think it is possible to cultivate a set of people you follow or so on on Twitter, that most of what you see is positive and reflects good academic decorum and so on, and I think that can be great. I think it can be, in that sense, a great way to engage with people or at least see what other people outside of Stanford, let's say, are thinking. I think that part is nice. I think another part, and I think the part I like the most about Twitter, is it's a way for me to see things that happen in other adjacent disciplines that I wouldn't see here. Right?

If I sit here in Gates, maybe I'll talk to my colleagues in other fields, but in general I won't know that, "Oh, this is happening in computational neuroscience." Or, "Oh, there's this cool work that uses GPT-3 to predict neural activity in the brain from Eve Fedorenko and colleagues and it's some interesting at the intersection of NLP and neuroscience." And, so on. These are things one generally doesn't maybe see if you're only thinking about NLP, or in my case, computer science, but that's fun to see. And, then finally I think it's at least certainly the other main set of actions I have on Twitter, is just signal boosting friends and colleagues, and I feel like that's a nice thing to do with someone doing cool work and has some new blog posts or something. I can retweet it or whatever, and that seems like a nice thing to do and maybe it's helpful for them to get their work noticed.

So, yeah, maybe I would suggest to not be a highly invested Twitter user. I think that can be both hard to sustain and emotionally taxing, perhaps, but I feel like Chris Manning is someone who sometimes has some really interesting takes on his account, and the Stanford NLP account is posting all kinds of things. So, it's cool to see him do that, but I don't know if everyone can do that. It's impressive.

Chris Potts:I always feel conflicted, because as a reader of Twitter, I love when people do those tweet threads on a new paper where they summarize it in 10 tweets with pictures, because I learn about the paper, remind myself later, but I feel too self-promotional doing it myself. I want it from other people. I really feel like I should get over this, and I just end up encouraging people: "If you have a new paper, do a little tweet form about it." I don't know that it actually increases your citations, but it is just illuminating to see what people are doing, and often they have their own creative aspects, these little threads.

Rishi Bommasani:And, I think it's fun too because you can do two things I think are hard to do in the standard paper. One is you can mention things that you wouldn't mention in a paper, about the research process or things you found exciting or so on, and I think that's sometimes fun is they'll have some little insight or whatever that wouldn't really go in the paper but it's something about what they learned or things like that, and the other is, as you were saying, it's just a different content format and I think it's nice for having your work get seen. Even if it doesn't manifest in citations or whatever, just seen by people from outside your main scholarly community.

Chris Potts:I mentioned the citation thing just because I don't think people should feel pressure. It's not like it's going to impact your work's uptake, because I don't think we have much evidence for that but it is fun and energizing to see it.

And, that reminded me too, all the cute phrases that I make my students cut from our papers because they don't sound scholarly, we can put those into these threads.

Rishi Bommasani:Right, exactly.

Chris Potts:Hey, Dhara, you have a student question to ask?

Dhara Yu:Yeah, so I have a question from Gabe. So, the question is, "You talked about some of your work making it easier to train foundational models for actors new to the field, but what if some of those actors are bad? Especially considering the future risks of these models, do you have thoughts about how or to what degree we should control access to this technology?"

Rishi Bommasani:Yeah. I think this is a pretty important question that at least I don't feel like I have a clear answer to, but I'll tell you what I've thought about on this topic. So,f irst for misus: I guess there could be a variety of forms of misuse. I guess with language models, often people talk about misinformation or spam or other types of things that make sense for text at least, it'll be interesting what people do for speech and so on, but for texts least it seems like we have some reasonable taxonomy of the ways in which a language model can be misused.

But, I think the next set of questions... So, our colleague Shelby Grossman at the Internet Observatory here has done a lot of excellent work on this, is thinking about, is a language model actually useful for misinformation actors relative to whatever other things they're considering? And, it requires some knowledge of how misinformation actors or other malicious actors are working. Is it favorable for them from their perspective to use this over human-generated misinformation or so on?

So, I think that's important if the question you want to answer is, what will be the realized misinformation harms of a language model being released openly? And, I think we're still learning that, I think that's the very frontier of research on misinformation. What I do think is clear is that the access is not binary at all in the sense that certainly there are two extremes – that no one can have access to a model and anyone can have access to a model in any way they like, but I think my opinion at least is that the policy – if you were going to choose to release a model – should be somewhere in the middle, and there are some strategies we have that are useful in that, largely adopt methodologies that are current on their field.

So, for example, in software engineering we see this staged release methodology, where you maybe incrementally expand the circle of who can access it and you prioritize people that you have some other means for trusting, right? And, certainly you can imagine releasing a model to researchers who are known in the field and then expanding it and so on, and I think this can be a good way to offset some of the risk and at least understand certain types of harms or identify risks, and so I think the consequences of different early strategies on what happens longitudinally is not understood at all, but I think we can at least make reasonable decisions or policies that involve non-zero release while still trying to be safer than just arbitrarily giving it out to anyone. At least that's the policy I would favor. So, yeah, at least that's what I'm thinking about is things that are in the middle, and what does it mean for it to be in the middle and what trade off are you trying to achieve?

Chris Potts:Dhara, do you have another student question or should I dive in?

Dhara Yu:Yeah. Yeah, we have another student question, so this is switching gears a little bit, but nonetheless interesting. So, another question from Sylvia, "How should researchers better explain their works to people outside of their fields? What are some challenges you faced working with people not from your domain?"

Rishi Bommasani:Yeah. I think this is a huge problem and a thing we don't really educate anyone on, how to do this well – how to do scientific communication well – both within the research community but beyond your field or even further to lay persons on how to understand scientific endeavors.

I think what I have learned is a few things. I think you definitely want to build a shared lexicon quickly with whoever you're talking to and simplify the language wherever possible. I think one of the key ideas here is that you should, at least in my mind, tolerate some imprecision in characterizing whatever you want for it to be clear for the other person, even if it's a simplified model or it's not actually exactly right. I think that's important, and once it seems like they've built the substrate of how to think about it, then you can sharpen it and add in the details if you have time, if you're talking to them for an extended period.

I think that's one piece. I think the other piece I would maybe add is that, for example, if we're talking about foundation models you can view this from so many different ways. Right? And, I think you should prioritize pragmatically what the other person is interested in, and I feel like sometimes people don't do this too well, and orient your way of describing things in a way that makes sense for them. Right? The way I talk about a foundation model to an economist is fundamentally quite different from when I talk to someone in HCI, right? What matters and what I want to prioritize is different, and I think that's pretty straightforward but I think it's something we do in life anyway, but I think it's also something that we do in science.

Chris Potts:Okay. So, we're almost out of time but I have two questions, one of which I want to ask all the participants. Dhara, are there any student questions, though, ahead of me in the queue?

Dhara Yu:No, go for it.

Chris Potts:Okay. The one I want to ask of everyone that I'm doing this series with is: have you seen this wager that Sasha Rush made, "Is attention all you need?" Here, I'll put the link in the chat, people can check it out. The bet is on the question, "On January 1, 2027, a Transformer-like model will continue to hold the state-of-the-art position in most benchmark tasks in natural language processing." And, they've fleshed out some of the vagueness around that claim, but you get the gist of it. So, Rishi, are you yes or no on this bet? Is attention all you need?

Rishi Bommasani:Yeah, I've been thinking about this since you told me you might ask me this.

Chris Potts:You can only give so nuanced an answer because you've got to put your money on one side.

Rishi Bommasani:Yeah, yeah, yeah.

Chris Potts:But, give me one.

Rishi Bommasani:So, I'm going to side with Sasha – I guess we share Cornell in common.

Chris Potts:All right, that's one for no. Sasha who is at Hugging Face, the company that is centered around Transformers.

Rishi Bommasani:That is absolutely right!

Chris Potts:He's also an academic, though. Maybe that exaplains his position.

Rishi Bommasani:Yes, and maybe I'll say a little bit there.

Chris Potts:Yeah. Now that your money is on the table, you can add nuance if you want.

Rishi Bommasani:I'm going to say I'm against it, and I asked Sasha this on Twitter too, not for the reasons he's against it, but for a different class of reasons. I think we're going to see, besides the whole architecture part about transformers or whatever, that what it means for benchmarks in the field is going to change a lot

Chris Potts:Good, good!

Rishi Bommasani:And what we choose to heighten and think of as the central benchmarks in the field will be different, and I think that, among other things, benchmarking for more than accuracy will already mean that Transformers are maybe not so monolithically dominant.

Right? I think if we start thinking about benchmarking as some mixture of robustness, accuracy, fairness, and efficiency or whatever, it's probably already true that they're not holding that state-of-the-art position so monolithically, and so I think that is why I think things will change. I think our distributional tasks will naturally change to things that Transformers work poorly on, because I think that's often a motivator for how benchmarks get designed, and independently, I think will elevate more things and so it'll just be way harder for anything, including Transformers, to have such a dominance over leaderboards. So, for both the reasons, I'm against the motion, I guess.

Chris Potts:Nice. Okay, and my other one – think this is an optimistic way to end, but it could be negative – it's a question I warned you about – which is just, throughout AI, we're constantly now congratulating ourselves about how much progress we're making. You hear it all the time, "Progress is incredible."

But, if you encountered a hard-nosed skeptic from outside the field who was like, "Yeah, okay. I've seen that your models can blather semi-coherently about the world. I saw the cool picture of the astronaut riding a giant space turtle that was produced by DALL-E, but what have you actually accomplished? And, I've also heard", let's say, to make it harder, "about all the negative things, the deep fakes and the synthesized text spreading disinformation." So, Rishi, why are we constantly congratulating ourselves?

Rishi Bommasani:I definitely think we congratulate ourselves too much, so that's something, but I will say I am positive, because I think there's a lot of real world use cases, commercial or not, these use cases that don't really require such unbelievable sophistication as that's sometimes I feel like we criticize AI for. I think there are a lot of practical things that would be beneficial, or at least we should experiment with where current technology is supposed to be able to do that in a useful way. I think that's one thing, because I think there's some practical utility for tasks. What task do people actually use GPT-3 for? And, it's not like NLP, one's using GPT-3 to do NLI for something, right? These are not super hard tasks. You want to generate new names for something, or all these kinds of things, so I think that's one thing where it's simple, it might benefit from some of the creativity these models sometimes display. That's one thing. The other thing I'm pretty optimistic about is the entire mode of interaction for people building applications I think will become lighter weight and we'll see more of a push towards using language as the medium rather than programming languages. I think that's cool and opens the door for a lot of other people to do creative things, because language is so ubiquitous and programming languages can be a barrier for some people to develop exciting things. So, I think that's another thing that excites me, is just thinking about languages as an interface and seeing how that goes.

Chris Potts:Yeah. Great, I like that as a way to close. Thank you so much, Rishi. This was really informative and fun, and thanks to all the students for participating and asking questions. I think this was great. We can all sign off. I'll stop the recording here.