CS224U: Natural Language Understanding

Podcast episode: Percy Liang

June 13, 2022

With Chris Potts

Realizing that Foundation Models are a big deal, scaling, why Percy founded CRFM, Stanford's position in the field, benchmarking, privacy, and CRFM's first and next 30 years.

Chris Potts:I'm delighted to welcome Percy Liang to the CS224U Podcast. Percy is Associate Professor of computer science at Stanford, and I'm proud to say that he's my colleague in the Stanford NLP Group as well. Percy is also a professor by courtesy in statistics at Stanford, and he's a member of the Stanford AI Lab, the Machine Learning group, HAI, and maybe some other groups that I've failed to mention there. Most importantly, for our conversation, he's the director of the Center for Research on Foundation Models, which we're going to talk a lot about.

Percy's contributions to AI are incredibly wide ranging. For me, they're notable for the way they bring together technical ideas from machine learning, innovative uses of data, and really creative approaches to analysis. For his thesis work, he showed that we can learn semantic parsers from denotations alone, which was completely eye-opening and inspiring to me. And he's continued to make striking contributions like this along with his incredible group of students and other researchers at Stanford.

Percy, thanks so much for doing this podcast with me. I thought we'd dive in with some stuff about CRFM. We're just past the one year anniversary for the Center for Research on Foundation Models, CRFM. What are some headlines about CRFM over the last year?

Percy Liang:Yeah. Thank you, Chris, first of all, for having me on this podcast. It's a pleasure to talk about some of these things.

Before answering your question, maybe I should say a little bit more about what CRFM is. So CRFM is the Center for Research on Foundation Models, a center that I started with an amazing group of colleagues, students, postdocs, and faculty about a year ago. The genesis of the center had to do with the recent events of these large models – now we call them foundation models, such as GPT-3, DALL-E and so on – really taking off in the AI space. They were ultimately all driven by industry, and there was a feeling at HAI, the Human-Centered AI Institute at Stanford: What should we be doing about this? What can we offer as academia?

The Center was formed around this idea that we had a lot to offer, that academia is full of deep interdisciplinary expertise across not just people in AI, but also from the social sciences, economics, political science, the Medical School. And together, we could study these models much more holistically and also with a kind of attunement to think about how these models could benefit society rather than being driven by commercial objectives.

So that was the start of the Center, and there are many things that are happening at the Center. It's hard for me to even keep track of, since there's over 200 students, faculty, and postdocs there. The most notable thing I would say was, when we launched, we put out this massive report – a 200-page report talking about foundation models, their opportunities and risks, from all of these different perspectives.

In addition, we are developing our own toolkits for training models, with an eye towards pedagogy and reproducibility and transparency. And then we have a number of research projects that are happening, which will probably come up a little bit later in this podcast.

Chris Potts:Hey, can we rewind a little bit? I'm curious: do you remember the moment when you realized these large language models were going to change everything?

Percy Liang:I remember when GPT-3 first came out. It was during the pandemic. And I just remember being kind of blown away by the things that it could do, for being just a language model. There's this idea of emergence that caught me and also, I think, many researchers, by surprise – that you can just train a language model, predict the next token on a tons of raw text, and then it can answer questions, it can summarize documents, have dialogue, translate, classify text, learn all sorts of different kind of pattern manipulation, format dates, and so on. It was just really eye-opening, I think, and made me much more open-minded about the possibilities here. And then at that point, I really knew that well, something is up. We cannot ignore this.

Chris Potts:There are three moments for me. I remember being at the best paper session at NAACL in 2018 when ELMo won. And there was a real energy in the room, and I turned to the person next to me, who was my collaborator, and we were like, "Everything is different now." You could tell. And then I just happened to be giving a talk at NYU when the BERT paper appeared, and Sam Bowman was like, "This BERT thing has happened." And I was like, "Is that just like ELMo?" And he was like, "No, this is different, Chris, you should check it out."

And then the third moment is a GPT-3 moment. I'm a skeptical person, I guess, or at least fashion myself that way. And when I first got access to the GPT-3 API and started playing around with it, my goal was to break it. And what actually happened was, it seemed to me to be a B+ student at doing formal logic. And that really weirded me out, because I was really trying to break it, and nonetheless, it seemed to be kind of doing the pattern matching that we need of our budding logisticians. And I guess that changed my mind.

Percy Liang:Yeah, definitely. I'm glad you brought up ELMo and BERT – also, let's not forget, GPT-2. Those were, I think, important precursors to this. And like you, I think I was also very much a skeptic, having worked on semantic parsing, where things are much more formal and principled in some way. And then working on adversarial evaluation, showing how models were really kind of broken in a deep way. It was with reluctance that I kind of got dragged into this, almost, in a way.

BERT and ELMo were impressive, definitely, in terms of, across a wide range of benchmarks, numbers just going up. But I think what really did it for GPT-3 was the in-context learning and how really it was much more general-purpose than I had imagined. GPT-2 was great. It was showing how you could generate long documents about unicorns and valleys, and it was fluent. But it was really kind of the general purposelessness of GPT-3 that I think did it for me. I would not have started a center based on BERT.

Chris Potts:I'm glad you brought up your semantic parsing work, because I remember you used to have slides, which I found quite helpful, where the Y axis would be kind of "structure to the system", with logic at the top, and the X axis would be sort of like "how much coverage you can get?" And you'd say like, "logic has hardly got any coverage, it's way to the left on the X axis, but it's high on the Y axis. And semantic parsing nudges us over into the coverage space while still being highly structured." Is that still a framework for you to think about the world or have foundation models put a twist into it?

Percy Liang:It's definitely put a twist into it. I mean, deep down, I feel like there's something fundamentally dissatisfying and missing about foundation models – logic and mathematical grounding. And think about theory providing a certain basis where you know things are kind of correct by construction. Foundation models, at least the current generation, will never be correct by construction because they're not constructed – they're emerged. You train them on data. It's very data oriented. So that part of me is a little bit sad. Of course, foundation models have huge implications for generation and help alleviate the sort of the exponential curse of dimensionality that you face. And I think they can help semantic parsers to improve.

You can think about it as foundation models as serving a bit like a System 1. They're really good at quickly arriving at some sort of gut reaction of an answer, but sometimes it's prone to spurious correlations and false reasoning. And you really would like a System 2 that could do the rational thing. But increasingly, more and more, all the action is in the foundation model space, because there's a lot more uncertainty, there's a lot more progress to be made, and there's also a lot of urgency into kind of harnessing this kind of raw power that we have, which is why all my attention is placed on it right now.

Chris Potts:It's interesting, because the landscape of ideas that you just presented is kind of around explainability, trustworthiness, safety. The dominant narrative, say, on Twitter, about all these models is just, how big can we scale them? Do you think that most of the technical innovations before us are about scaling or are they actually in those other areas that you mentioned?

Percy Liang:I mean, I think scale is something that you can always pull out. No matter where your technology is, more scale, more data, as long as it's for machine learning-based systems, certainly more data and more compute will just amplify whatever you have. The whole scale thing isn't new. For ages before deep learning, Google was proclaiming itself as like, all you need is scale. At that time, scale kind of meant, well, scaling up nearest neighbors or something like that. So the scale part hasn't really been new. What's really changed actually is: different methods. Now, we have deep learning instead of nearest neighbors.

I would guess that, in 10 years, there would be something else. I think you would probably still be taking gradients and doing various things, but maybe the way that we would construct these models would be different – hopefully more modular, hopefully more trustworthy and understandable. But yeah, we would probably want to scale those too.

Chris Potts:The scaling thing is interesting to me just because that's another point where I feel my mind changing. I was sort of skeptical that scaling mattered, and now I'm starting to think that it might actually be qualitatively different to have a model that's truly massive versus one that's just large, say, by today's standards. But there are at least three axes to scaling, and I think they could involve different expertise and different challenges. One would be: more and different data. Another would be training for longer. And another would be just having a really large parameter count. How do you think about those three factors when you think about scaling?

Percy Liang:Well, first parameter count is, I think, a fairly poor approximation of the ability of a model on its own. We have these mixture of experts models now that have many more parameters, but are not necessarily more capable.

The amount of data is definitely a key ingredient and it's all coupled to the amount of compute. For example, DeepMind had a paper about a model called Chinchilla, which was smaller actually than their earlier Gopher model and GPT-3, but it was more capable because it was trained for a longer. So compute definitely has a role, and also more data. You obviously need enough scale to be able to you learn the appropriate representations, but I think the three things are all kind of mixed up, in a way. When you read these papers on scaling laws, you see that where you scale any one of these, things go up. So they almost have a weird kind of symmetry, even though they're incredibly different objects.

Chris Potts:Oh, but couldn't we think about it differently. Because one of my takeaways is: for a fixed budget of data – for a fixed amount of data that you have – and, say, a fixed number of epochs, there's no point in scaling your model past a certain point, whereas you might want the model to be much, much larger to the extent that you've got more and interesting data to pour into it.

Percy Liang:Right. I think it's complicated because sometimes you want a larger model. Larger models can converge faster. So given a fixed compute budget, maybe you actually want a much more over-parameterized model, because that would allow you to get to a better point. Of course there's limits, if you scale it. There's a point at which scaling doesn't help. But I think we're probably all in this regime where we're not hitting the extrema of any of these three things.

Chris Potts:Right. And this might just be my position in the landscape, but I was with you that we can always scale on the model size and the amount of training, because that does seem like an engineering effort where progress has been very rapid and continues to be rapid. I'm less clear on that for the data part. I think it's going to take some real creativity, and certainly different kinds of expertise, to break us out of where we are now when it comes to the data these models consume. What do you think of that?

Percy Liang:Yeah. It's interesting. On one hand, you could say that the amount of high-quality data – there's just not that much. There's only so much of Shakespeare. There's only so much of Wikipedia. We're basically training on all of it. What else is there? And we know that data quality does matter. On the other hand, the amount of data we're training on is an insignificant portion of the entire Internet, even just looking at text.

One other thing I'll mention is that the amount of private data that companies have is much larger than the amount of data that we could conceivably train public models on. And this is not to mention at all the multi-modal aspect, where the amount of images, videos, audio files – this is orders of magnitude larger just by virtue of being a different modality, non-text modality, where it's not as compressed as text. So I think the opportunities for leveraging that data for foundation models – because foundation models aren't just language models, they're multimodal – I think we've hardly even started I think there.

Chris Potts:Totally agree. Yeah. I've been thinking that, for my money, instead of pouring in more texts, I would pour in more video and sensor readings and other things like that, but the problems remain the same. We should return to this later. A lot of that data is, by definition, private, and it's certainly difficult to wrangle it all into a data set that we could use to train a system.

Percy Liang:Yeah. Not just technically difficult, but also from an ethical standpoint, that's the problematic issue.

Chris Potts:Yeah. That's the issue I definitely want to return to. Yeah. But before we do that, so what about technical innovations around things like how to train the models or how to make them safer, more trustworthy? Are you seeing some exciting things in that space?

Percy Liang:Yeah, there are some efforts here. So it is now a commonplace to not just think about language models as "predict the next word," but they need to be fine-tuned in some way. For example, based on human preferences, OpenAI has been doing this with their newer GPT-3 versions, which are much better. Anthropic has been doing similar things. This can lead to models which are perhaps much better, but it's more costly because now you require crowdsourced annotations. This is also related to a number of efforts trying to do, basically, instruction tuning, where you start with one of these large language models and then you can train it to do various tasks in hopes that it would generalize to other tasks. And then you can kind of tune these.

So I feel like these are definitely allowing us to make progress. Models are getting better. I wonder how far you can get with these methods, especially when it comes to things like truthfulness or safety, because it feels a little bit reactive. The model does something bad, you say, "Don't do that." And now the model doesn't do that anymore, but it does something else bad. And how many examples? What's the sample complexity of taming this model? Because the surface area with these models is so vast, and even if you could hit everything, maybe there would still be adversarial examples or exploits of these models. Having spent a few years working on adversarial examples, I think I'm very pessimistic about kind of having completely robust guarantees on large models, just because they're kind of a leaky abstraction.

Chris Potts:Oh, and "guarantees" feels very strong! I mean, even people are of course susceptible to adversarial attacks. They're just kind of different from the ones that we encounter with machine learning artifacts, but everything seems susceptible to being tricked in some fashion. Right?

Percy Liang:Yeah. I mean, I would like to hold our machines to a higher standard in some ways. I think too often the argument goes, "Oh, people do that, so it must be okay." Well, I think there was a time when humans were at the pedestal where we wanted to aim, but we know that these systems are already superhuman on many axes. So on the other axes that matter for safety, why would we lower our standard there?

Chris Potts:Well, that's so interesting because I was going to say maybe it would be reassuring if their adversarial susceptibilities were similar to humans, because then at least we would understand them at a human level. But what I hear you saying is, "No, no, no, we can do much better. And it would be very dangerous to aim for mere human level!"

Percy Liang:Yeah. I think these are very alien objects. So I think good luck in trying to get them completely to align with humans, but also we should do better.

Chris Potts:To be superhuman in this respect. Yeah, that's certainly going to be important if we're going to deploy these systems in all the ways we expect to deploy them.

Percy Liang:Yeah. I mean, in some sense, we already have this with computers. They're superhuman. They're robust and reliable along so many dimensions that humans couldn't even dream of. And so with AI, I feel like there's a kind of a lowering of expectations here because of the framing around intelligence and how it's trying to mimic humans, but we should hold them to the bar of computers.

Chris Potts:That's lovely. That's a nice transition to my next question, because like, yes, my computer is superhuman when it comes to doing addition, but GPT-3 is not superhuman when it comes to doing addition. GPT-3 is some weird kind of human with its own sort of clear lack of understanding of what this operation is like. So that kind of leads me into the question of like, what are the breakaway successes that you've seen for foundation models in general? Where are they being used productively?

Percy Liang:Yeah. So I think foundation models are being deployed in many places. One of the challenges is it's not clear where exactly they're being deployed. If you count BERT as a foundation model, which I would, then they're everywhere. How would you do NLP today in a kind of scaled-up realistic setting on real text without something like BERT? So, in that sense, those are deployed.

Now, maybe there's an important distinction between BERT, which provides embeddings that help language understanding, versus generations. So now you can ask, "Okay, what about generation now? Where are those deployed?" And that is a tougher question and very, very interesting. There are cases where generative models can be safely deployed even in a generative condition, provided that there is a restricted state space.

So imagine a generation system where you have a set of templates that defines a grammar. You could still decode with the language model essentially as a kind of a re-ranker, and you can do this as you decode from the grammar. This is a good way for maintaining safety and still gaining the benefits. Of course, if you want to have it generate, because it's much more powerful than decoding from your grammar – which is the whole point of these learned models – the natural place to do it is in various kind of human-assisted settings.

The primary example, I think is, GitHub Copilot, which is based on OpenAI's Codex model. So this is a language model that generates code. And this is actually very useful for suggesting code, completing lines, you call this function or use this variable. And the nice thing about it is that there's a human in the loop who's hopefully paying attention and making sure that the code is correct. And you can also think about this in writing contexts, which is something we've been studying – how do humans interact with these language models?

So the first category was BERT. You just use them in your existing applications and it just is a better representation. The second application area, I guess I would say, is human in the loop, which feels relatively safe. You have to still be careful, because GitHub Copilot can generate potential security vulnerabilities and, if the person is asleep at the wheel, I guess, so to speak, then you could get some pretty bad code. And also with text, you can obviously get bad text. So it's not completely solving the problem, but it's mitigating some of the harms.

Now, in the third category are things like open-ended chat bots where you're just talking to a bot. That I think is quite risky. So Google has put out this LaMDA model, which they recently released under the AI Test Kitchen, which they're inviting people to experiment with, but this is labeled as "Danger, danger; use at your own risk!" And these are definitely not ready for deployment at scale, in the kind of the third category of, I guess you can think about this full autonomy if you want.

Chris Potts:Let's concentrate on the generation cases because I think that really is tied up with why we're talking about all this stuff at all – this surprising capacity of these new models to generate, whereas before we had models that mainly could just represent and classify. And a lot of the dangers come from generation, I think. And I'm glad you pointed out the Copilot – the Codex model – because that seems like a case where people are getting real benefit from generation. The other areas that come to mind for me are all about creative self-expression, like people generating visual art with models like DALL-E 2, and people doing creative writing, as you mentioned. For the more hard-nosed things like code generation, do you know of other success stories that are outside the creative arts?

Percy Liang:Yeah. So in the category of generation, but not creative. I guess you could argue that code can be creative, but-

Chris Potts:Sure, sure.

Percy Liang:Yeah. It's a good question. I mean, I feel like writing in general is so pervasive, everywhere. Soon maybe we'll see these models being used to complete things in your email or Google Docs. And so any place where you're typing texts, I imagine these models potentially having a say in what you type, which is interesting because, on one hand, it's like, "Oh, wow. It saves me a lot of time. I'm so efficient now." But it's also really important to think about the flip side, which is basically, I mean, someone's controlling what you're doing. And there're certain biases that can arise as a result of that. And are you really crafting this message or are you sort of just more passively approving something that someone else wanted you to say? So this really, I think, gets into a really interesting question about ownership and expression.

Chris Potts:Absolutely. We actually had this discussion for my spring course because Richard Socher visited and he showed us a bit how you could use you.com to essentially write part of a lit review. And so then the question is, "Is Chris Potts, as an instructor, okay with his students writing their lit reviews using these AI generation tools?" And my first answer was, "Sure, why not? We can just require students to review 30 papers as opposed to seven, because they'll get all the benefit from automation." But then I started to worry like, I'm not sure how to articulate this, but my students might not be in the counterfactual states that you need to kind of verify that it's a good lit review, or correct it if it's not, because having had it generated, they won't have the expertise required, but I kind of felt like, for myself, if I could generate one and then review it and approve it, then that would be totally fine with me, I guess?

Percy Liang:I mean, in the context of the course, the question is whether you can actually tell the difference, maybe GPT-3, you have to think it's a great lit review, whereas it is just generating nonsense.

Chris Potts:Well, that's very pragmatic though. All that matters is whether we can tell whether it was generated by a student or a machine, not the mental state the student is in? I guess that's fair. All we've got is behavioral evidence.

Percy Liang:Right. I mean, this was a core issue in math classes when calculators came along. Do you allow students to use calculators for math tests or not, or whether you even allow open-book or closed-book exams? All of these have their pros and cons in terms of trade offs.

Chris Potts:But for the most part, at the university level, my response to all that has been: definitely we have exams that can be totally open book, and certainly allow people to use calculators, because that's a cognitive aid that I want students to benefit from. But the lit review thing puts me back into an ambiguous state because I don't know that they'll acquire any expertise from using GPT-1000 to write part of their paper for them.

Percy Liang:Well, maybe they should be allowed to use it, but then the test should be different. Maybe the assignment can't be "write a lit review," it should be like "publish a paper."

Chris Potts:That's right! So expectations go up and up and up!

Percy Liang:And with calculators now you're not spending so much time crunching numbers, but now you're focused on higher-level cognitive tasks. Maybe that's what happens with these technologies in the optimistic case.

Chris Potts:There's one remark you made earlier that I want to return to, which is: we were reminiscing about the moment when we thought things had changed. And it was, for both of us, around GPT-3. My response was to just think about it and try to do research in this new context. Your response was to found an entire center. Can you say a little bit about the impetus behind that and why you decided to do it?

Percy Liang:Yeah. I never thought I would... I wasn't looking to start a center, to be honest. I think the thing I wanted actually where we started was, I had a quick discussion with Chris Ré actually about, "Hey, why don't we train a GPT-3 at Stanford? Why not? What's stopping us?" And it was almost kind of a whimsical thought experiment, which then led to looping and Fei-Fei and Chris Manning and others from HAI about this kind of crazy idea. And then what happened was that it became clear that, while there was a lot of energy and enthusiasm for this area, merely replicating the model, although it would be impressive and interesting, isn't really the full story. And then we started talking to other folks at HAI and thinking about what else academia could bring.

The center kind of grew organically, bottom up. I went around – this was in the spring of... Wait, 2021. I'm getting my years mixed up. Just talking to people. And it seemed like there was just a lot of interest. People are already working on various forms of foundation models or were really excited or interested in it. And it was just kind of opportunistically capturing the momentum and deciding, even internally at Stanford, to allow all those folks to come together and discuss ideas for, especially, application areas. Like people in the Med School to be able to share the problems and connect with folks in AI who would partner up and collaborate. So I felt the center initially was really a way for people to share experiences and find collaborations. I think many arose out of that process.

Then, on the other hand, well, we've been thinking about what makes sense for the Center to take on as big projects. And one major thing is, I think we are uniquely in a good position to kind of evaluate and benchmark and document what's happening – the state of the world regarding foundation models. Really critique. And if there are problems with them, talk about what they are, but keep it very grounded to the artifacts that exist and applications that exist. So there's a number of projects that are kind of in the works along those lines.

We are still working on training and scaling up models. I don't think we'll ever compete with OpenAI or Google on this front. But I think getting to a kind of moderate level of scale is almost necessary to do some of the research in this space because of emergent behavior. And I think that you can actually get a lot of value by just scaling, even to, let's say, 20 billion parameters, without going to 175 or 500 billion. And so the value there would be to help Stanford, but also the community, have these tools and models that could enable the research, because it's more the means to an end. The scaling up is a way for us to kind of study and improve these models, because we're not interested in commercializing these models that we train.

Chris Potts:I'm glad you mentioned the diversity of voices associated with CRFM, because that's really what I think about when I think about the Center, and not so much these technological artifacts. And this is why it was so painful for me to see on Twitter that people were describing the white paper as written by "the entire CS Department," which wasn't true in either direction, given all of the contributions from people elsewhere on campus. Can you say a little bit about who else is involved and what kind of perspectives they bring?

Percy Liang:Yeah. There are a number of people outside Computer Science who had made large contributions to the thinking and framing around the whole center. Just to name a few that I've interacted with most closely: Rob Reich, who is a political philosopher, who is really interested in the political economy of these models. Now that they're out there in the world, what are their ethical problems when you have centralization or scale and so on? Or Erik Brynjolfsson, who is an economist, who I've been collaborating with to understand the economic impact of these models. Dan Ho, who's at the Law School, who thinks about both law for foundation models, which involves things like copyright or a different privacy risks that come about for these new technology, but also foundation models for law. So, using these technologies to enable building of new kind of legal applications for legal reasoning and so on. Or Russ Altman, who's in biomedicine – we've been brainstorming a bit about how you can train biomedical models that can be useful for a number of applications, maybe scientific discovery or more in the clinical space as well. Those are just some examples of people I think, which have contributed a lot to the center.

Chris Potts:I've been meaning to ask you: is Jeff Hancock from Communication involved at all?

Percy Liang:His name has come up a bunch, and he would definitely be a natural person to talk to.

Chris Potts:Absolutely. The fresh angle they would bring is that he and his students have been working a lot on how people, regular people, understand these technologies – their mental model of what a large language model is like, or even what a recommendation algorithm is like. And I feel like that needs to be part of any policy discussion we have. What is the public conception likely to be? And where is it at odds with what we know about the technologies themselves?

Percy Liang:Yeah. That's a great point. Yeah. I mean, I think the there's a lot of confusion out there in terms of what these models are. Even among kind of people who are closer to the technology, because, on one hand, you see that these models are biased, they're toxic, and have all problems. But then there's some mysterious process by which these models are actually deployed in industry. And there's kind of a disconnect there. How do these problems relate to the actual harms that real people are experiencing? One of the things that we're trying to do is to try to close the gap and provide more transparency around these, so that the academic community, who looks at these models as kind of research artifacts, can actually think about the downstream consequences more as well.

Chris Potts:Oh, that's another question I've been meaning to ask you: is it special and important that CRFM is an academic institution as opposed to one that was founded inside the tech industry itself? What does academia bring that's special here?

Percy Liang:Yeah, definitely. I think besides the kind of the variety of expertise, so certainly that shapes the way that we think about things. I think probably the point that can't really be negotiated is that we are an independent institution. We're situated in academia. The incentive structure is different. Our goal is to do research, to talk about the state of the technology for just a benefit of the public good, and not to commercialize.

There is a large overlap. I think there's some overlap between the things that we would want to do research on and what a commercial entity like Google would do, but fundamentally we're coming from different places.

Chris Potts:Yeah. So I love that answer. And certainly, I personally want that answer to be true, but I worry a little bit, because I know that I'm a person subject to biases, and I also know that my students often get jobs at places, like these big tech companies, and we often get funding from them, and it feels like our worlds are pretty interconnected at this point. And so, although I like to think of myself as independent, I worry that I'm being influenced by all those interconnections, especially the one around my trainees getting jobs at these companies and how that might shape the kinds of things that I'm willing to say or not. So essentially my own independence. Do you worry about that for the Center?

Percy Liang:Yeah. This is definitely on my mind, and it's a really good and tough question. Of course we have students who go off into industry, and we are connected and talk – we have many friends in industry and talk to them. I do think that there is a separation. I guess, one question, to ground things out, is: what is something that we would do that would be different if we were completely independent or completely attached.

So, for example, if we discover there are some problems with these models that we want to get fixed, we have the ability to go out and basically make a huge ruckus – basically, be very critical of these models.There's some organizations who would probably do that. I think our approach would probably be to be a bit more constructive, by saying that there are these problems, highlighting these problems, and giving companies opportunities to fix them.

Now, if they don't fix them, then I think we have a right to go and challenge and say that these are things that are problematic. It's analogous to the security world. Where a security vulnerability comes out, it would be sort of reckless to go and publicize it so that everyone can exploit the model. You want to notify the organization to have them have an opportunity to patch it. If they won't patch it, then I guess at some point it has to be reckoned with. So I think our stance is independence, but there is a certain amount of collaboration and constructiveness there.

Chris Potts:That is a wonderful analogy, I think. And that is the test case. Given the set of experts that you've assembled, it seems very possible that in the next few years, that group would come to some consensus about how you would evaluate a foundation model for its safety and deployment to an unknown population, say, or a population in the U.S. And then suppose, further on, that in good faith, one of these large tech companies has provided CRFM access to the model to audit it because they want to know, but they've also invested heavily in that model and are counting on deploying it, and CRFM discovers that it's not safe for deployment and maybe in a way that's not salvageable. And then the security thing kicks in – so you'll say, you have a few months to withdraw the model or fix this, but for whatever reason, the company can't, or won't do that. That's the test case for CRFM's true independence, right?

Percy Liang:Yeah. And I think it would be interesting when that comes up. I think the thing that's difficult about these things is that companies obviously have way more context about what's going on on the ground. CRFM is operating sort of under veil when it comes to actual harms in the world. And one of the things that we've been trying to do is to get companies to be more transparent about these things. And I think one of the benefits of that is that you can actually make reasonable and well informed decisions. At some level, a company would know better about the actual harms on the ground because we only have partial information. So transparency, I think is one thing that we want to promote.

Another is thinking about the process. For example, just take the issue of model release. So when, or can, or should, people produce models and give them to researchers for academic use. There's a lot of disagreement about when this is allowed or not. So for example, Google and DeepMind are pretty tight-lipped about giving access. OpenAI is somewhere in the middle. Meta just released their model fully openly. Obviously different circumstances, different models, and so on. But we wrote a blog post recently about how the problem here isn't necessarily the individual decision, but it's the lack of legitimacy when it comes to each organization, just coming up with their own decision by themselves. And what we really need is a much more coordinated view and community norms around how to make these decisions jointly, because it's something that affects everyone. You can't have unilateral climate change decisions that people just make because we lived in this shared space.

Chris Potts:So that's a nice transition into this event that you just did on Twitter with representatives of OpenAI, Cohere, and AI21. They did get together as a group and publish a set of guidelines – some rules to live by for developing, deploying, and studying large language models. And this was sort of your chance to interrogate them a little bit. And I certainly applaud the effort. I think it is noteworthy that these are important, but still small-scale, players. And maybe it's good that they're setting some guidelines up now while they still are small. But yeah. Can we talk a little bit about the event? Does that sound good?

Percy Liang:Yeah. Sure.

Chris Potts:Actually, at first I wanted to ask you though, for the Twitter Spaces thing, that seemed kind of interesting to me, should I be doing this podcast on Twitter Spaces?

Percy Liang:If you want to operate everything on your tiny mobile device, then Twitter Spaces is a great way to do that. If you, like me, prefer a larger screen, then maybe it's not the time yet!

Chris Potts:And what about the interactive component? It sounded like you regretted not being able to involve more of the community because of technological stuff.

Percy Liang:Yeah, it was my first time using Twitter Spaces. And I think for many of the people, there were some rough edges. I think the general idea seems appealing. You can reach a larger crowd, and many people would join. The norms around when people can chime in and discuss – it wasn't very much kind of a community-based discussion, because, in the small amount of time, we got one audience question, whereas I feel like normal Twitter is actually much better for involving a large set of voices. So that's something to be kind of figured out.

But going back to the statement and the Twitter Spaces event – so a few weeks ago, OpenAI, Cohere, AI21 Labs put out the statement on best practices for deploying large language models. It was a short piece on how we – meaning they, I guess they – should monitor usage and prohibit misuse. They should take steps to ensure that these models are safe and also involve the broader community. There weren't too many details. I think it was well-intentioned. They agreed that it was a first step in saying, "Okay, we need to do something about that." So it's just really an opening move in a sense. So I think the real question is, what happens after that week? So not much has happened. But I'm hoping that we can use this as a way to now actually make some progress and do some work to develop community norms, set targets for deployment.

For example, unless your model passes certain safety standards, you just cannot deploy it. These are things that in other industries are standard, but we don't have anything like that for large language models. And it would be helpful if companies can commit to targets, just like in carbon emissions, countries pledge, "I want us to obtain a certain target by a certain date." And so all this stuff still needs to be figured out. How we're going to do it, that's still kind of up in the air, to be determined, but there's still a lot of work to be done, but hopefully that this sets a tone where maybe they're willing to kind of listen and incorporate feedback from the community.

Chris Potts:Right. I agree. But this might be an example where we need more, not only stakeholders, but also people with expertise to get involved in making these statements. Let's take one example: responsibility. You described some measurements we could take to determine whether a model was safe, and that would presumably be some combination of auditing and benchmarking. So suppose we have those standards in place, and then I'm a customer of one of these companies and I use the model and it misbehaves, it sends out spam, or it says something hateful in ways that I didn't intend. Whose responsibility is that? The company could say, "We satisfied all the requirements of deployment. You, the customer, are at fault, because you must have been misusing it, given that we offered you this guarantee before." Whereas my response might be, "You have clearly sold me a faulty device and whatever testing you ran was insufficient. And therefore you are at fault, not me." And it's just unclear to me how those notions of responsibility will be resolved. And I'm worried a little bit about the legal system taking over and resolving them for us, because that could be unpredictable from the perspective of us as technologists.

Percy Liang:Yeah. This is a really good question. In some ways, places like Google might have it easier, because if you encapsulate the end-to-end product, where the foundation model is just something inside, you're ultimately responsible for whatever happens. But if we are moving to a world where there're going to be foundation models just offered up as services, as APIs, which seems like a likely future, then, yeah, the question is, what is the contract? When someone buys the service, what can it do and cannot do? That's why benchmarking is so important. We need a nutrition label or a spec sheet for these objects that we're selling or people are buying and using.

At least initially, it would just be completely impossible to capture everything. And there's also gaming that happens where we say, "Okay, these 10 benchmarks, we do well." Well, naturally, the companies would say, "Okay, let's work on making these benchmarks go down." The problem with this is it's just a vast surface area of other risks that aren't being taken to account. And like you said, what happens if someone encounters one of those other things? Well, then I guess we would have to amend the benchmark and proceed from there. But yeah, there's no great answer for this.

I do think that we can look to other industries as who have gone down this path and have much more mature ways of thinking about this to inform us, because I think for AI, and foundation models in particular, this is just a sort of unknown territory.

Chris Potts:I think that's exactly right. And I'm so glad you mentioned the contracts because my initial instinct there is to say, that is what matters. Guidelines are unenforceable, but these contracts, and the question of what can be in them and what's enforceable from them, is really the issue. And that's where all of these things will be decided. And so I wonder whether our efforts should be oriented toward having some norms around what can and can't be in those contracts, those terms of service.

Percy Liang:Yeah. I think the difficulty is being specific enough to actually have some sort of falsifiable or verifiable property.

Chris Potts:Well, when you use software, you've got a very specific document, the terms of service, which you agreed to – that is a current response to all this indeterminacy, which is just more pages. And then the question arises, what parts of it are actually enforceable?

Percy Liang:Yeah. Software, yeah. That's a good point. I mean, in terms of complexity and ease of reasoning about things, traditional software is much easier. Like you can write down a few principles and you kind of capture what there is to know about the piece of software. With these models, there's just unintended things you can do with them. That's part of the power, is that you can come up with a new task that no one on this planet has ever done before. And chances are GPT-3 probably will do something with it. So there's no way of saying something about all the infinite number of tasks that you could possibly try, and applications you can build, but that's a challenge. But I think one of the things that we're trying to think through is, how do you even conceptualize this space? Because I think we lack sort of a framework for thinking about the capabilities and risks.

Chris Potts:Let me give you one more scenario in this area. The first part of this statement, as you said, is focused on what they call prohibiting misuse. And that made me nervous, because I'm not sure who's doing the prohibiting or what's going to count as misuse. And they give clear examples like, no spam, no fraud, no astroturfing, but actually at least two of those cases are already interesting gray areas for me, because suppose I'm using the model to share what I think is vital information about world events and someone comes in and says, "That's astroturfing" or "That's spam, shut it down." Who is going to resolve this issue that I have between my product or service and this person who's saying it's violating the terms of service?

Percy Liang:Yeah. That's a tricky question. I mean, it has to do a lot with the challenges of content moderation.

Chris Potts:Exactly. Right there. Just stumbling into the same area that we know is so fraught. Yeah.

Percy Liang:Yeah, it's going to be imperfect. So I don't know. For better or for worse, we can lean on all the great practices and content moderation that we've seen, or at least we've seen that movie before! So maybe we can sort of do a better at least for foundation models!

Chris Potts:How would you feel in a hypothetical future where someone brought in CRFM as an entity to resolve some of these things – just called on its expertise as a kind of independent auditor?

Percy Liang:I think we would have to build up that expertise and capacity. It is definitely possible. So one of the things that is unique about HAI, and also by kind of transitivity CRFM, is that, normally in an academic institution, there's research that happens, there's teaching, but there's no reason that those have to be the only things that happen. And HAI has sort of a policy branch. One could imagine that part of CRFM's effort is doing these audits and trying to do things which are not necessarily about publishing papers or research, but the practice. We don't have a staff or a team doing that right now, but it's certainly something that would be possible. And I think CRFM would be uniquely positioned to do that. But again, it's just a matter of prioritization and timing.

Chris Potts:Interesting, though. So, for another hypothetical, setting aside constraints on time and resources, just think about the expertise that CRFM represents across all those different disciplines: suppose one of these startups, with the best intentions, came and said, here, we have trained this massive model. We want to write a really bulletproof model card for it. And maybe a data sheet for the associated data set that it was trained on. We want you to be the auditors. Find out everything you can, so we can really disclose everything that's possible. Do you have a sense for what you would do or would you just have to refuse the job on the grounds that it's totally unclear what success looks like?

Percy Liang:I mean, first of all, I think that the data sheets and model cards are only a very small fraction of what is necessary, because there's the data and the model, but it's really how this model is actually used downstream and the potential ways that they seek.

Chris Potts:Well, that's covered. That's one of the questions. What are the uses that you know about and what are the approved uses? I think that's even covered by the data sheet.

Percy Liang:Right. But I think the point is that there's a raw model, and then there's how that model is being deployed. And I could imagine that, in the future, I don't think it would just be these raw language models that people get exposed to. It has too many splinters. You want to sand them down. You want to put polish on them. There'd be other components that allow the system to behave more properly. In that world, what we should really be doing is to look at the whole pipeline. We have the model – that's important to know – but it's also important how you're using this model downstream. And I think it would be really useful and productive for, again, hypothetically, CRFM to say, "Well, let's see. How are people using it? What are the use cases? Where are the logs?"

Now, of course, this interacts with privacy, which is, I think, another thing that needs to be sorted out. Because if you're perfectly private, you don't collect any data, but then you have no idea of what people are doing with the tool. And in order to also solve some of these problems, to detect misuse, you have to log something. So there's a tension there between privacy and misuse.

Chris Potts:That's already really substantive to me though. Because what I hear you saying is you would have to refuse the job unless you were actually given access to the full system it was embedded in. It's sort of like saying, I can't tell you whether this vehicle – like a truck – is safe, unless you tell me exactly how it's going to be used. There's no absolute test of its safety, because I just don't know what you're trying to accomplish.

Percy Liang:Yeah. I think that's right. Because you don't know if they're driving in snow or up a big hill or through a puddle. And so the context matters a lot here.

Chris Potts:But to return to CRFM's current projects a little bit, I know you all are deeply embedded in this large diverse benchmarking effort, which I take it is oriented towards some of the themes of the guidelines around safety and auditing and setting up some standards. Do you want to say a bit about the goals of that project and what's been happening?

Percy Liang:Yeah, so the project is still underway, so hopefully there'll be more information about it soon, but I can say briefly that there's maybe two noteworthy things to point out.

One is that we think about trying to make benchmarking more systematic and modular. What I mean by that is, often, benchmarks are a basket of tasks. So each task has input–outputs, you train, or you prompt a model, and then you read out some evaluation. But I think what this misses is the fact that there's a lot of crosscutting concerns. So things about accuracy, bias, robustness, toxicity, efficiency, all these things are metrics that need to be assessed for every, what we call, scenario. So for example, if you're doing question answering, you shouldn't just be measuring accuracy and then having another task where you're trying to balance on your gender pronouns. You should be thinking about bias in the context of the tasks. So that's one of the main thrusts of this, this benchmark.

The other thing is that we're experimenting quite a lot with interactive evaluation, by which I mean: mostly we think about evaluation of NLP systems as automation. Here, you show up with a model or a system, we ask it to do task X and we see how well it does. But one of the exciting things about these models is that they're good enough to warrant interaction. So with GPT-3, you can use it to collectively to write a story. You can have a dialogue with it, you can use it to brainstorm ideas. And so what we like to do is measure the human plus machine together as a unit, rather than thinking about the LM as just a kind of isolated entity. And that I think really is a reflection of our deep commitment to these models as augmenting humans, rather than replacing them.

Chris Potts:It sounds wonderful. And so what will the work products be? Are they going to be reports, data sets, protocols, software?

Percy Liang:Everything.

Chris Potts:Everything!

Percy Liang:So there will be papers. There will be a code base for people to build on. One thing that I would like to have is to have a website where one can actually browse the benchmark results, look at the actual predictions of models on the various tasks, including the interactive ones. Because I think all too often, we read papers, we see charts and graphs and everything's going up, but it's not really clear what's going on inside them. What are these tasks? How are models actually failing? When it goes up from 35 to 40%, we declare success, but what about the other 60%? I think having much more transparency about the detailed interactions and predictions can alleviate some of this kind of mystery.

Chris Potts:And so is this all going to culminate in a proposal for how, if I have a model that I might want to release as part of a system or not, but just release it into the world, I would first release it into this ecosystem and learn a lot about its limits and things like that. So it would be a kind of auditing of the system?

Percy Liang:This would be a benchmark that anyone producing a large language model could and should probably run. How they use the results of it to inform their release or deployment decisions – we don't have much to say about that, because it depends on exactly how you are deploying it. I think it's a followup project to try to connect those two, though. And we have some other thoughts on how to relate the properties of the upstream model to the ramifications and social consequences of downstream application. Those two things are distinct. Often people conflate them, and we need to think about both simultaneously.

Chris Potts:It's interesting. You're towing the line of being an auditor, if I understand correctly, because you're going to have this system, and people are going to get information from it that they might be quite happy about and in fact brag about, and maybe even use it as the basis for a claim that they were behaving responsibly in the world, but you definitely don't want to commit to that. You're not indemnifying them or anything.

Percy Liang:Yeah. There's no CRFM certificate that you get when you do well on this benchmark. I mean, I think, honestly, we want to really move away from this idea that there's like a leaderboard typical in AI that you're trying to climb. Really it's multiple metrics. And there's just tensions. You can't have everything and you have to make trade-offs depending on what you want to do – efficiency, accuracy, bias, and so on. So what we want to do is to at least provide the transparency that these are concerns and facets that are important to report. And then it's a very contextual decision to make about what you do with this. If you are using it just for ranking, maybe none of the generation stuff matters, but in all cases probably thinking about the social bias aspect is important.

I don't claim that we've figured out the right metrics for that either, because I don't think there are "the right" metrics for bias on an upstream model. You can measure various things, and the more measure things we measure, the better the chances that we can have some of these be relevant for the downstream applications.

I think it's important to realize that these benchmarks are so, so limited, I think – such a kind of a small drop in the bucket compared to the vast space of capabilities out there. The amount of data and tasks and things are way larger in the 500 gigabytes of text than these benchmarks, which are just thousands of examples, you can't possibly cover everything with the benchmarks.

Chris Potts:So much of our discussion is covered by what I believe should be called Strathern's law, which says when a measure becomes a target, it ceases to be a good measure. And that's kind of in the background of all of my questions about concerns, and maybe roles that CRFM might unwittingly play in a measure becoming a target.

Percy Liang:Yeah. I think this measure has to evolve over time.

Chris Potts:Yeah, exactly.

Percy Liang:Certainly, it has to be dynamic. And one of the things that this benchmark seeks to do is, by providing a framework or thinking about what benchmarks are other, we can not only keep track of what we have measured, but also what we like to measure but can't, due to limitations.

For example, we want coverage of different, let's say, ethnic groups, for all the data sets, but we don't have a data set which is elderly people asking questions, or non-native English speakers, or Black speakers, asking questions or summarizing these type of documents. But noting that I think is important. So then when you come across an application where that becomes important, then you can go and you say, "Well, maybe we need to fill out this entry in the benchmark."

Chris Potts:Right. Let me ask you one more question along these lines and then let's talk about the future of CRFM.

Percy Liang:Yeah.

Chris Potts:My one more question is about privacy, because you've mentioned this a few times. And the way this comes to mind for me is I've been trying to fill out data sheets for all the data sets we release now. And one of the questions on the data sheet is, "Did the individuals in question in the data set consent to the collection and use of their data?" And for the data sets that I've been releasing, I can fairly confidently say yes, because of the ways we've been collecting them. But for anyone who's going to release training data for a large language model – and I think this should also apply to the model artifacts themselves – the answer is going to have to be, no. Those individuals did not consent. At least most of them didn't. So are we as a community just going to quietly tolerate a lot of "no" answers to this question, or what's going to give here?

Percy Liang:Yeah. I think you put your finger on something that often gets brushed aside. There's no way that any of these large language models are kosher from that perspective. There's not even a mechanism – even if you wanted to, how would you go and get consent for every single webpage out there? How do you even contact the person to get consent? So we're kind of in this sort of pickle by construction. Now, depending on your appetite, I mean, I think there's a spectrum. You can try to get consent for some things. And the idea of model cards and data sheets is that you're at least honest about what you're reporting.

On the other hand, yeah, I mean, I guess in principle, someone could challenge you on that and it's like, "Look, this model didn't offer consent." I don't think there's any legal action one can take because some people argue this is fair use, because it's certainly a transformative application, but of course, ethics and law are sort of different. So I think that a lot of this is just murky territory.

Now, there's things on the technology side which are perhaps interesting. You could imagine – we've been exploring some of these with synthetic data. We show that you can sometimes get some of the gains of these foundation models just by pre-training on synthetic data. And you might not even need real data in some settings. So this is not a panacea by any means, but it can mitigate some of the concerns.

You could imagine, let's say retrieval augmented models, which you've worked on a bunch, Chris, where you have a smaller model where you don't try to pack in all the knowledge, but you can, at inference time, swap in the knowledge base, which maybe you do have consent for, or you restrict appropriately to solve whatever task you want. So that modularity can help you determine what data you can and cannot get.

So these are just two examples, but I think we need many more examples to help us maybe dig ourselves out of this hole where we're just grabbing all these data points and training on them. But I think we're not going to fully get out of this hole.

Chris Potts:So I'm completely in favor of pragmatic approaches to this. And I'm not an extremist, but your description that you gave before, you can imagine someone replying something like, "Well, hey, we want to build a database of genetic information about people, but what are we going to do? Ask everyone's permission when all this genetic material is all around us to be harvested? Let's collect it and not worry because why should we have to approach everyone who's going to be in the database?" There is a legal response to that already. And you can imagine our society evolving to a point where every digital trace that we leave online is protected in the same way that our genetic material is. And therefore the answer is no, you can't train and deploy that model. Not without permission.

Percy Liang:Yeah. I mean, I think there's a really interesting future here. It's not clear that not sharing any data by default is necessarily the best outcome, because we do get a lot of value from pulling data, and we get better services as a result of it. One thing that has been discussed in these kind of what is known as kind of data dignity circles – Microsoft has a bunch of folks thinking about this – where, if you take someone's data, you should at least pay them for it. And the one of the problems is that all this data is being harvested, and a few folks are getting a lot of value out of this basically "free data". But the people who actually spent – the artists, the musician, the writers who spent a lot of labor producing this data – are not gaining anything out of it.

So one thing you could try to do is rectify this. And to do that, you would need to have much better ways of thinking about provenance – where data comes from. I mean, I can imagine a future where every artifact that's created, there's some sort of terms on it – what people can do with it, which can change over time. And the organizations that use this data have to obey the terms. If they derive value, some of the value goes back to the original creators, and things can change. I say, "Okay, I don't want you to use this data," and they have to be able to remove it. I hope we can move to a much healthier ecosystem where there's a lot greater transparency and flow around data and predictive power.

Chris Potts:Great. Do you have time for two more questions?

Percy Liang:Sure.

Chris Potts:So the easy question is, what's up for CRFM in year 2? I also want to ask you the hard question of where you would like CRFM to end up in say, 30 years, when you're finally retiring from your position as director and about to enjoy a long retirement of just doing research? What kind of state do you hope the center is in? You can do the easy question first.

Percy Liang:Yeah. I don't know if CRFM will be around in 30 years.

Chris Potts:Oh, no. Plan for the future. Many centers on campus have lasted that long, and there's no reason to assume that CRFM wouldn't, so.

Percy Liang:Let's see. So the easy question, in the next year... So as I mentioned, the benchmarking effort is underway and that's going to occupy us for a little bit more time. We have some plans to create better infrastructure for people to do meaningful research on language models and also, more generally, foundation models going forward, scaling to maybe 20 billion parameters.

So I guess maybe just to kind of summarize this: you can think about three pillars. There's social responsibility, where we're trying to document, evaluate and increase the transparency of all these models to build community norms. So our plan is to kind of keep on pushing on that.

The second is technical advances. We haven't talked about this as much, but there's new ways of doing pre-training, there's different architectures more on the technical side and there's a lot of work there. And what CRFM can hope to do is provide the infrastructure so that students and researchers can leverage and make technical advances in those regards.

And the third is applications, where we are collaborating with members of other departments and other schools on campus to look at the different data sets and the challenges associated with them and even using the existing technology that we have, trying to derive some value from that.

Then for your hard question, in 30 years. When I say it's not clear that CRFM will exist – it's not even clear that foundation models themselves will be a thing that is on people's mind, not because they will disappear, but because once things become mature enough, they fade, and all good technology kind of fades. If we're still having the same arguments about foundation models that we're having today, I'll be very disappointed in what happened in these 30 years.

Chris Potts:I think it's the sort of term that's pretty flexible and will survive a lot of technical innovation, frankly. So my answer would've been much more like – well here, let me get your reaction. Prediction would be that to have an impact, it's going to have a large Washington lobbying angle, like in DC. Or you could say, actually all you hope is that it's got 500 researchers doing interdisciplinary research you never would dream of in 2022, nothing to do with Washington DC. Or it could have founded startups all over the globe and empowered people on every continent to develop trillion perimeter models. Yeah. We might as well imagine a future in which there are still foundation models, radically different from the current ones, but...

Percy Liang:All I meant to say is that the research questions, the way you think about things are going to be different. So today we don't think about deep learning in the same way that we thought about it 10 years ago, when it was just kind of first getting started. In 30 years, I think the abstractions and the terms that are of interest, I think will be quite different. Of course foundation models, I think, will still be around in some form, just as machine learning is a term is still around, but it'll take on different significance and we'll think about it – what it can do and what it is – in a very different way.

Given your examples, building off of your examples, I see kind of what you're trying to get at with this question. I mean, hopefully this comes before 30 years, but I hope that CRFM will be a respectable player in terms of shaping the norms around how these technologies are built. And not only that that there will be other institutions like CRFM at other universities and outside kind of the normal kind of commercial sphere that have a similar goal.

I mean, to do all those things you mentioned, requires much more than just one center at a university. And I also think that it doesn't make sense for a kind of a mission of that scale to be restricted only to Stanford. It should be something that's really decentralized and governed in a much more decentralized way.

But I hope that there will not be a such a big gap between the folks that have large foundation models and the ones that don't, which is where I think the world is tending to – where these models are getting more and more expensive, to the point where maybe it'll just be a handful, that you can count on one hand, of players that can actually have boost the most cutting edge ones. I hope we can avoid that future through both technologies for opening this up but also other types of action.

Chris Potts:But what do you think about my idea that, to achieve all of that, in 30 years, you're going to have to have to have lobbyists in Washington, DC, Beijing, and, say, Kinshasa – I'll just make a prediction about how the global situation will change. And that's the way you'll empower all those people and make sure the norms get set in the way that you want. I'm not volunteering for this work. I too will be retired, I hope. But what do you think about that?

Percy Liang:I think it's a possibility. I mean, this is definitely outside the scope of what I signed up to do, to be honest. I'm at heart still much more grounded and at home in what I can do at personally. But for CRFM... Oh, okay. If you ask about CRFM, I think that's totally fair, because I won't be director at that point, it'll be someone else who sends the lobbyist to Washington, but I'm not sure that's exactly my cup of tea, to be honest.

Chris Potts:I totally agree!

Thank you so much for doing this, Percy! This was a wonderful conversation! I really appreciate it!

Percy Liang:Yeah! Thank you, Chris, so much! Thanks for the excellent questions and the discussion!

CS224U: Natural Language Understanding

Podcast episode: Percy Liang

Show notes

Transcript