Podcast episode: Sasha Rush

October 4, 2022

With Chris Potts

Coding puzzles, practices, and education, structured prediction, the culture of Hugging Face, large models, and the energy of New York.

Show notes

Transcript

Chris Potts:All right! Welcome, dear listeners, to the CS224U podcast. My guest today is Sasha Rush. Sasha is an Associate Professor at Cornell Tech and a researcher at Hugging Face. He has made wide-ranging and highly-influential contributions to structured prediction, controllable text generation, explainability, and many other areas, and he's also done formative work on the practice of programming machine learning systems, from the low-level code to the design of our core libraries, and all of this work has led him to, among many other things, a very impressive collection of best paper awards at our top conferences.

Sasha, welcome to the podcast. I'm delighted that you could do this with me. I thought I'd start in a place that's really near and dear to me personally, and that's the wonderful ways that you've blended research and education in your work. For me, the first thing that comes to mind there is "The Annotated Transformer". Tell us, what is the Annotated Transformer, and what reactions have you gotten to it?

Sasha Rush:Yeah! First off, thanks for having me. It's great to be here.

The Annotated Transformer was a blog post that I wrote about Transformers and how they work, and it's in a style known as literate code, where it alternates between text and PyTorch code describing how the model works. The name "The Annotated Transformer", is inspired by this book, The Annotated Alice, which is this lovely edition of Alice in Wonderland that goes between the writing of the text of Alice in Wonderland and comments describing various jokes or songs underlying the text itself.

The blog post was a culmination of several different projects that I was working on at a time. From an NLP side, I was working on a library called OpenNMT, which was an open source neural machine translation library. We had been supporting this for about four years, all the various aspects of translation. I think, as someone working on practical machine translation, when Transformers came out, it was obviously useful right away, because it increased the state-of-the-art on translation by a sizable amount. I don't think at the time I could have foreseen that it would become so important in all different aspects of machine learning, but it was pretty interesting to me as a translation researcher.

From a machine learning side, around that period of time, PyTorch had just come out and been popularized. As a person programming deep learning for NLP, PyTorch was revelatory. It was just very much an improvement over the tools that had existed to that time, and it allowed us as researchers to build systems in a way that felt closer to the mathematical underpinnings, particularly in an area like natural language processing.

Then, on top of that, I was reading some PL work, particularly work from the Haskell community on literate code, some blog posts by people like Ken Chan and EZ Yang, and how they so nicely described very hard technical problems in the Haskell community in a really fun way to read. Combining those influences, like NLP, ML, programming languages, and my own curiosity, I was like, "I'm going to try to write a really long blog post about a model that I'm thinking a lot about these days."

Chris Potts:I want to ask more about that, but I can't help just interjecting there. Ken Shan -- remarkable person. Do you know Ken personally?

Sasha Rush:I had the chance to meet him, actually, when I was an undergrad, because he was, I guess, working with Stuart Shieber. I forget if he was a grad student or a postdoc, I had a chance to meet him then, yeah. He's an incredible person, yeah.

Chris Potts:He was undergrad and grad student there. Yeah, he worked with Stuart Shieber. He really was and is an incredible person. He swept into linguistics, brought in all of these ideas from programming languages, in my mind, revolutionized our theory of how quantifiers take scope, and then ventured even further out -- probabilistic programming and things like that. I remember my first meeting with him was at a conference at MIT and he came up behind me in line for something, like waiting for coffee. He just started expressing all of these ideas and I had no idea who he was, and I couldn't make heads or tails of what he was saying. I remember being kind of dismissive and now I think back on that -- when people like that come up to you with lots of energy and lots of ideas, you should listen even if they're not making sense, because it might be the next Ken Shan!

Sasha Rush:Yeah. No, it certainly jibes with my experience as a professor with others throughout the field. Yeah. No, I think in some sense actually, I feel that way about all of these different areas: one of the amazing parts about the popularization of deep learning is that taking really deep ideas from other areas and simply translating them to deep learning world has been really useful and fun to do. When I read some of these blog posts that some of these other people write, they're so much more impressive. But if you can do half as good a job, but do it on a topic that a lot of people are trying to learn right now, that can be a valuable service, I think.

Chris Potts:Oh yeah. That's what brought me to the Annotated Transformer. It's just as a better way to read the paper. How did it begin for you? Did you set out to write a blog post or was it initially just notes you were taking, and then code you were sketching, and then pretty soon this vision?

Sasha Rush:Yeah. I often find actually that's the way I write a blog post is I'll work on something really hard and spend a lot of time really working on it, and then no one will really care, and then I'll just take out the core part that I think is the most interesting or most simple and try to polish that, and that's I think an easier way to get people to try it out or dive in.

In this particular case, it was literally code that I was writing for OpenNMT, but there was so much cruft about managing models or training or all those things. I really just wanted to take out the core part that was the paper, and the difficult part of implementing that, and solidify it.

Chris Potts:In the spirit of literate programming, did you adjust the code at all to tell the story of the paper? Or, did all of that just fall together naturally, because the network is so modular or something?

Sasha Rush:Yeah. I certainly did have to adjust the code to be in the order of the presentation of the paper itself. But certainly, the advent of PyTorch as a way of structuring systems and pulling out, say, parameterization from the actual functional structure of the network really helped out in that process. I think, in particular, I read a lot of the code that was released with the paper itself and it was unfortunately written in the style that was quite connected to the Google infrastructure and very difficult to read. So, while it was quite powerful and allowed them to scale, I think a lot of people struggle to pull out the details from that code base.

The paper itself actually, I think, is quite clear, but maybe assumes a lot of NLP knowledge that was contextually relevant at that time, and so I think, particularly for people who didn't come in really understanding those ideas, it was maybe a bit more difficult to read. I generally like the paper and people criticize it, but I think for a paper that's been read by almost everyone in ML these days, it does a relatively good job getting across the main ideas.

Chris Potts:Oh yeah. Actually, I thought it was, it's clear as a paper. I actually think there are parts of it that are quite strikingly clear. The hard part for me is that they don't really knit together all the pieces, so you get this flow of things. And then there's a diagram, and actually, the diagram initially left me quite confused, because I think I didn't know what level of abstraction it was presented at, and so it was actually probably only in going through your Annotated Transformer and seeing the code that I actually saw what the units were and how they fit together and stuff. And then you read the paper and you're like, "Oh yeah. No, this is wonderfully expressed," and you have a mental model of all the pieces.

Sasha Rush:Yeah, one reaction I've gotten from the blog post is people should just write papers like this.

Chris Potts:Yes!

Sasha Rush:I partially agree, but I partially disagree. I've seen some papers that interleave, I would say, imperative code with math, and I'm not sure I totally buy it. Code goes out of date relatively quickly, so certain things change and also, it pulls in things that are library decisions from the main structure. I actually maybe take a relatively conservative view that math is pretty good and that maybe in an appendix or something, you can have code, but it's not necessary to have it in the paper itself.

Chris Potts:That's a wonderful point. I had a note to ask you about that, though, because there are clearly tradeoffs. I don't know. I feel like we, the people I work with, spend a lot of time working on taking things that are already expressed in code and figuring out how to express them in math, so to speak, and the notational part of that is really hard. Then, for the reader, there's a burden of having to then figure out through this pipe of math what our code would look like or what we meant. I sometimes feel, "Hey, wouldn't it be better to just present people with the code that's really what we're all dealing with anyway?" But your point about it falling out of date is well-taken.

Sasha Rush:Yeah. I guess, probably, as it was clear from my previous comment, I do like a declarative way of specifying things. I guess I come from a background of working on like parsing or graphical models or areas where there is a non-ambiguous structure that describes how things work, but is maybe not directly mappable to imperative code. I do find that to be an elegant way of describing things, although, in a world of neural networks, obviously less important or less relevant.

Chris Potts:Another thing you might have tipped me over on -- I'm trying to remember what the timeline for this is, but certainly for me, the experience of using TensorFlow -- which is the first deep learning library that I used a lot -- and using PyTorch, those are fundamentally different things, where with TensorFlow, at the time I was using it, I felt quite constrained, like I was being limited in what I could say, whereas with PyTorch, it fluidly mixed the Python and the rest, and that made it much easier for me to think of this code as a direct expression of my ideas, which might indicate a bias that I think in terms of code and not the underlying math. Did you feel that as well as you moved into PyTorch?

Sasha Rush:Absolutely. Honestly, one of the reasons why I'm so interested in this blogging or code writing is really the experience of first writing code in PyTorch and feeling like it was the first time I really could communicate ideas directly in a way that I felt comfortable and mathematically writing.

My background was that I was a grad student of Columbia, I worked a bit at Google, I was writing very large-scale translation and parsing systems in low-level C++ code. Then, in 2014 I was a postdoc at Facebook and I actually sat next to Soumith Chintala, who was the lead developer of Lua Torch, which was the foundational library before PyTorch came out. It was a wake-up experience, I think. When you go for the first time, from a lab you feel very comfortable in to one where you feel like an outsider, and everyone's doing something an entirely different way, everyone had moved to this deep learning framework universe. For a year, I struggled to figure out what the heck was going on and rewire my brain to think about things in terms of auto-differentiation and tensors and things of that. Lua Torch wasn't really there yet. It hadn't really hit that feeling of, I could actually express mathematically what I'm trying to do, but PyTorch really, it did it for me, yeah.

Chris Potts:I teach at a large natural language understanding course and there was a period where we talked a lot about tree-structured neural networks. The TensorFlow expression of those networks at that time was just really difficult. It basically had to be a black box that students would use and they would do optimization things but not rewire it. Then, when we switched to PyTorch, you could just express those things in exactly the way that you would think about any tree or any parsing algorithm and you saw the recursion and everything, you saw the data structures, and that just really open doors for students and for me in terms of thinking about those networks.

Sasha Rush:Yeah, yeah, absolutely. I think, for me, it was writing beam search, and particularly trying to back propagate through beam search was something we always wanted to do and it was extraordinarily hard to do before libraries like PyTorch. But I should also give credit to related libraries like, I'm blanking on the name right now, but Graham Neubig and Chris Dyer's library predated PyTorch and had a lot of these graph structure elements as well.

Chris Potts:That's right. Yeah, you could do the same things that you could do in PyTorch in that library. I confess that I also forget the name, but I remember seeing their implementation of the tree structured tensor networks and it, again, just looking exactly like you'd want it to look if you were thinking about implementing it with no guardrails on.

The flip side was, and I think that this remains true to this day, you might disagree, but TensorFlow was more trustworthy if you wanted to deploy something, because it did handle a lot of the gotchas around the optimization loop that are still there in PyTorch as far as I can tell. The fiddly details about how you update various data structures were all behind the scenes for TensorFlow.

Sasha Rush:Interesting. Yeah, actually I never really did too much programming in TensorFlow, although I do use JAX these days more and more. Although, I'm not sure if that solves some of the problems you're describing, but it does have some nice benefits as a different system.

I should just say, the library we were referring to is called DyNet, and I think that a lot of people really love that library as well.

Chris Potts:What's pushing you to use JAX these days?

Sasha Rush:Being honest, I really love the team. Matt Johnson has always been someone who's really helped me learn a lot about these topics and interact with a bunch of the other folks who work on it. I think they are introducing a lot of really new ideas that are pushing the field forward. Things like vmap are extraordinarily influential and being able to do higher order differentiation and things seamlessly is quite nice. There's also, I've been interested in neural networks that require some more efficient computation, and the fact that it has a JIT-step -- that has been very helpful for some of the recent projects I've worked on. That's nice as well. That being said though, I'm quite happy with PyTorch and I'm glad that it's really caught on at least outside of Google.

Chris Potts:The other pedagogical thing that I wanted to thank you for and get your views on is the Tensor Puzzles that you created and also the GPU Puzzles. For the Tensor Puzzles, I guess I want to thank you, because I had a long plane flight and I just set myself the task of doing these things on the plane flight and it ate up the entire time. It was like I looked up and suddenly we were landing. I found them so absorbing and, in places, quite challenging.

For the Tensor Puzzles in particular -- what's the origin of that? In particular, I'm interested to know what the guiding intellectual insight is that lets you create this set of puzzles, and figure out what the space of puzzles could be.

Sasha Rush:Yeah. The boring answer is that I taught a large undergraduate machine learning class and I graded a lot of homeworks and it was disturbing how many of the mistakes that students were making were simply permuting dimensions of tensors. It made me realize, watching people actually program in these libraries, that actually people didn't really understand broadcasting at all and oftentimes, with just brute force, try out all different combinations and hope that things matched up. That's really bad. There's many degrees of symmetry that can lead to mistakes on these problems, so I wanted to convey in a fun way how you might learn about this process and try it out.

Based on my previous answer that sometimes you just do the hard work and pull out the fun part, this is a part of a textbook I've been writing called "Mini Torch" that has people implement PyTorch from scratch at each step, and I just took out one of the homeworks and made it a standalone puzzle.

The more interesting answer is that, I mentioned the team behind JAX. They have another programming language they work on called Dex, which is led by Adam Paszke and Dougal MacLaurin. Dex is a Haskell language that basically only has tensors as the only data structure. It blew my mind writing Dex code for the first time -- everything you could do just by broadcasting or just by these tensor things. I wanted to communicate that insight that came to me from trying out this niche avant garde programming language to people just doing PyTorch and just trying to convey that intellectual, "Oh! Wow! That's amazing," in a language people might be more used to.

Chris Potts:As you do the puzzles, you go through this intellectual journey of like, "Okay, I can see some of these things are linear functions and I have an intuition that this should be doable," and then, there are some that ask you to change the size of some data structures and at first you're like, "Well, this is probably just impossible. Is this a mistake?", then it dawns on you that it can be done. It's just really eye opening how much can be expressed with so few primitives.

Sasha Rush:Yeah, it's really neat.

Chris Potts:I think it really, in addition to being a good mental exercise, like doing crosswords or something, it genuinely did teach me a bunch of things about the power of those broadcasting operators.

Sasha Rush:That's awesome to hear. The other thing I'll mention is another side project that I've interested in the last couple of years related to this challenge of teaching people to program in tensor languages for the first time, was this project on Named Tensors, of trying to get programming languages or even mathematical notation to give names to the dimensions, with the goal of turning these things that look like they're puzzles and look like they're crosswords into just something a little bit more intuitive.

My goal, at the end of the day, is not to have everyone in the world know how to write cryptic one-line tensor code, but to motivate the type of thinking that makes it maybe a little less concise but gets across the same idea. One project actually that came out of the PyTorch team recently is called Torch Dim and it's a new name tensor-type library and the person who wrote the library actually re-implemented a bunch of these puzzles in a way that's a little bit more -- how do I put it? -- readable.

Chris Potts:Oh, fascinating.

Sasha Rush:It's feeding back on itself as a way of trying to make this a little easier in the future.

Chris Potts:The named tensor thing, that's a vision that you had and that has matured into an aspect of PyTorch now. Did that already begin with the students making all these mistakes and stuff?

Sasha Rush:Yeah. That came out of the same motivation of just trying to get people to write code in a less error-prone way, break some of the symmetries that come up in deep learning. I think also trying to get at some of the issues you described of trying to translate papers to code.

Going back to the Transformers paper, there's a very famous formula in that paper for multi-head attention that's expressed in linear algebra form. It has many degrees of symmetry and I've seen every mistake made of permuting tensors in the wrong direction, and so I tend to use that as an example of a really famous formula that, when translating from math to code, it's extremely important to get right and any error checking or help it could get would be better, I would say.

Chris Potts:Right. Yeah. The named tensor thing is something that I, and I think many people, have done in their mathematical notation for a long time, and then, in retrospect, it's puzzling that people didn't just build that into the code itself, yeah.

That's funny about student code. I think you see traces of students being confused and being on this weird journey when they have done four or five extra transposes of the data structures and that's what they submit and you think, "Oh, I see what happened. Yes, they iterated through until it finally clicked," and they probably submitted the one that just ran.

Sasha Rush:What's funny is half the time they get it right and then they're like, "Oh, I'm so smart, I did it." It doesn't fail under the test, and so they're not getting any feedback there, that this was all repetitive or redundant.

Chris Potts:I have to look back at my own Tensor Puzzle solutions to make sure I don't have lots of this now. That would not surprise me at all, because certainly that was part of my own journey there.

Sasha Rush:Yeah!

Chris Potts:The GPU puzzles, also wonderful. I'm almost through those. I couldn't do them on the plane because I don't have a laptop with a GPU. Those are also wonderful and have been really instructive for me in terms of finally pushing me to think about what's happening on the GPU. What's the origin story there? Is it also in the classroom that you felt like you needed to teach people what was happening?

Sasha Rush:Yeah, it also comes from the same underlying class on machine learning engineering behind the scenes. I think what always struck me about GPU programming is that it's actually not so challenging in terms of, there aren't so many operations, it doesn't look so different than standard code writing, but it is a little bit mind warping, particularly when you're dealing with shared memory.

I think the particular insight that I think is one of these things that I think everyone who does machine learning should know is just why matrix multiplication is so fast. I feel like that's one of those things that is just ... I don't know, it's one of the most important facts about the world right now, which is that this particular operation could be done extremely efficiently and continues to be more and more efficient as the world goes forward.

Just as an intellectual, I don't know, right of passage, anyone using these models should just understand how matrix multiply actually works. It's not a particularly hard thing, but it does require just having some intuition of shared memory and how that algorithm is actually implemented.

I think it's particularly relevant in terms of Transformers because, again, I often go to talks now and people anthropomorphize Transformers. They talk about how smart they are and how they use attention and attention mimics all these properties of human intuition and things like that. Part of me is like, "Maybe that's true. I can't falsify it," but it would be a real coincidence if the one architecture that is the most fast to run on hardware is also the smartest architecture. I think it's an important thing for people to understand the context: this is the thing we can run really fast and scales really, really well, and so we shouldn't assume that it also has some other magical properties. It might just be extremely convenient, if that makes sense.

Chris Potts:I see. So your view is that it's a historical accident that we ended up in a world with GPUs with these particular properties and therefore, the Transformer has found its niche in that ecosystem. It's not an inevitable thing about computation or technology.

Sasha Rush:Oh, well, it might be an inevitable thing. I don't actually know which way the causality runs. I guess, like Sarah Hooker's paper about, about the hardware lottery. They're quite honest in the Transformer paper that they chose this particular form of attention because it's the most efficient to run and other people have noted how easy it is to scale this model. It seems like a totally fine intellectual conclusion to be the convergence of GPUs and this model fits really nicely for scaling without making any claim about the particular cleverness of this architecture for, say, natural language processing or even human thought more broadly.

Chris Potts:I guess what I was pushing you on is just this idea that maybe, if you just describe it as capacity to do a whole lot of simple parallel operations and very flexibly pool the output of those operations into coherent pieces, that actually might be wonderful for survival in our universe, and therefore, this is just a reflection of the fact that pervasively that's a good way to get along, and so it's not an accident.

Sasha Rush:Oh, yeah. That's an interesting hypothesis. I don't totally know how you would falsify that but it sounds interesting.

Chris Potts:One question I had to ask you later, I'll ask it now is, is the world we've ended up in, in terms of ML toolkits, biased against structured prediction? Is there a counterfactual where the fastest thing to implement is something with a lot of rich structure and not a bunch of matrix multiplication but rather something else and therefore the Transformer has no hope of getting started because it's just too slow?

Sasha Rush:Just to provide some context: My dissertation and some of the work in my group has been focused on structured prediction, particularly models like Hidden Markov Models or context-free grammars. It's an area that I still am quite interested in, almost as a hobby, and it's another reason I'm interested in low-level GPU programming, because making other sorts of models besides neural networks efficient on low-level hardware is a challenging problem.

The term "bias" is interesting, because I don't think hardware providers -- they're not biased against certain types of research. Let's see. How do I think about this? I think sparse models, or models that really just are far away from dense matrix multiplication, are very hard to utilize these days, to the point that I tend not to even think about how to make them work on modern hardware.

Structured prediction is an interesting middle case where it actually can be written in a very dense way. You can map it to a lot of matrix multiplies. And so, therefore, you might imagine that it could run efficiently in modern hardware. It still is not as efficient, but that might just be a matter of working through some various challenges.

The bigger issue is just that we don't currently have an application that really justifies all the extra work. Transformers have become so good at supervised or fine-tuned or even zero-shot versions of these tasks that it's just not super clear what benefit you get from having a clever model. That being said, I do think there are a lot of other related areas that we maybe have less data for or less strong pre-trained models where having structured prediction, plus auto differentiation, plus GPUs, is a useful extension of the toolkit. If I had to make a huge bet, I wouldn't say they're going to change the mainstream NLP in the next couple of years.

Chris Potts:What about outside NLP? It looks like you found some applications for these ideas quite recently in, maybe, domains that are a little bit more structured and systematic than natural language, dare I say? Is that right?

Sasha Rush:Yeah. I shouldn't say that I've done this personally, but I have collaborators who work in areas like biology or chemistry who utilize these approaches. Yeah. I think it's extraordinarily interesting. I think there's some people in NLP like Jacob Andreas's group or people in programming languages like Kevin Ellis at Cornell, who are doing some really interesting things that intersect structured prediction with deep learning. Kevin Ellis' stuff is about inducing programming languages. Jacob Andreas is very interested in compositionality and structure within neural networks, doing some really interesting things with structured prediction.

Chris Potts:Cool. Let me ask one more question about programming because you have made really foundational contributions to a bunch of libraries that all of us use, both in terms of the actual code OpenNMT and also these vision statements with the named tensors and stuff. How do you, in your career and in your life, balance prototyping those things and getting the word out about them versus really the production-ready aspects of coding?

Sasha Rush:Yeah. There are some parts of my career that I work hard on and struggle with and push forward like writing papers, working with students, teaching classes. This library and blogging stuff, it's like that part of your brain that just has to do it and it's almost always gets in the way of grant-writing or working on projects. Every year or something, I have some project I just get really excited about and I just take a whole chunk of time and do it. I wish I could control it, but it's really just a pathology. I spent all summer working on the graphics library just for fun. I don't know why I did it, but it was very interesting. I can't say there's any method to this madness.

In terms of history, I think that when I was a grad student, the way that libraries work is that they were extraordinarily complex and they took dozens of people to write them. They were international collaborations among many scientists. In about 2013, 2014, when deep learning first hit, there was a brief period of time, for about three or four years, where you could get a state-of-the-art library in 500 lines of well-written code. That was really striking to me, that you could take something that was maybe a couple hours more than research code, and it would become really good and really solid. That lasted, I would say, about 2014 to about 2018, and so it was very fun during that period of time to be able to make production systems.

I would not recommend that a graduate student these days try to make production code. There is a whole career path now of deep learning engineer. These people are fantastic. They are really good software engineers and they really understand how to build systems, and that's great. It's wonderful that that exists and we get to use their work in their way. I don't really try to compete with them.

I would say, at the moment, you probably shouldn't try to write production-ready code, but who knows? It's a very cyclic deal. If a new idea comes out -- we're seeing things like diffusion come out now and people can build awesome things with very little code and that's really exciting. You should leap on it when you have a chance to I think, yeah.

Chris Potts:It's interesting that you mentioned those different career paths. That leads me into this next question which is, are we, as a community, doing enough to appropriately value contributions in terms of libraries and things? There could be nuance there because maybe there are other communities that are going to spring up that really do just, first-order, value the engineering aspect of this. What's your view about NLP in particular? Are we appropriately valuing code contributions?

Sasha Rush:Yeah. It's a good question. I think we could do more, but I do think there are some nice ways that the NLP community has made this a possible academic path. I've utilized the *CL demo track, which has become more prominent in recent years as a way to submit artifacts that are open-source code contributions. I think that's been nice and it's been helpful for me.

In terms of an early career faculty, I found that when I was applying for jobs and asking about tenure track things, people did discourage me from doing things that weren't directly academic.

One of the reasons I like Cornell Tech's process is that we have -- Cornell Tech is the university I work at -- they have explicitly as part of your tenure process an external engagement aspect and they actually get letters from people who utilize, say, your open source tools or things, and that's made it a little bit more clear and part of the process itself. I've also used things like classes or other educational things as a way of justifying open source contributions -- using things like open source projects I developed for pedagogical reasons as part of my teaching statements or things.

Chris Potts:Oh, that's really cool.

Sasha Rush:Yeah, which is nice.

Chris Potts:Yeah. My own perspective is that, in the span of my career, we went from really under-appreciating contributions of data sets to kind of appropriately valuing them. I think that was a concerted effort, to have best paper awards specifically for data sets and things like that, and also just a shifting norm around really making sure to cite the origin of the data set and so forth.

Then, there was this whole new category of things -- pre-trained models -- which I guess really began with word2vec and GloVe, and people I think saw instantly that that was a way to have real impact and get a lot of attention, because you were empowering people directly with these parameters, kind of leveling up from data or complementing data.

I worry a little bit about these code contributions. I feel like people are better at citing the Hugging Face transformers library and citing scikit-learn than they were 10, 12 years ago. But I still worry that people slip into taking it for granted as part of the background noise. If I looked at my own papers, I feel like I've been good about citing transformers, but maybe not good enough about citing PyTorch, and since citations are the currency of our land, not doing that can be really meaningful in aggregate.

Sasha Rush:Yeah. I think that's a fair point. I would say that the bigger challenge beyond citations might be making these systems more sustainable community-wise. Hugging Face is a startup, but it's building a lot of these tools. It's wonderful while we have them and have them available. PyTorch is run within Facebook, which is great, but it always has this question of what's going to keep these going or there's actually, I think, probably not that many people within the day working on these projects and they're given their value to, I would say, the whole world.

Chris Potts:Speaking of Hugging Face, what's the story about your involvement there? You must be a pretty early employee, right, because you joined all the way back in 2019, which is ancient history by now.

Sasha Rush:Yeah. It is wild how quickly things move outside of academia. Three years is a really long time in startup world. Yeah, I joined Hugging Face in 2019. There were about maybe 10, 15 people when I started. There are now several hundred folks there and they've expanded from NLP to all sorts of ML applications. I think I went to Hugging Face thinking I could help them out and really I've just learned a ton from them. They're really an amazing team. They just have this mindset of making impactful tools and research for the ML community really broadly defined. It could be open source contributions, it could be research papers, it could be training models, and it normally is all of the above.

There are a couple interesting things about Hugging Face. One is their organizational model: it's almost completely distributed, and so it manifests itself as just a giant Slack channel with everyone and anything going on simultaneously. Then, I think one of their main contributions has been these open source libraries, transformers, but also datasets and the evaluation library and now the diffusion library. These are all tools that I've used and appreciate.

The other thing that's been extraordinarily interesting is the last year they ran a collaborative open-source project called Big Science, where they had about a thousand people working on this from all around the world. It was neat to be a participant of that. We worked a lot on training models for different NLP tasks and the culminating artifact of that was this model called BLOOM, which is 176 billion parameter language model.

Man, I just think that training 176 billion primary language model might just be outside the scope of something academics can do, but it's really important that it not just be something large companies can do, and so having this intermediate organization that can do the organization and pulling together the people with expertise to do that was really fascinating and was beyond even my expertise on, say, engineering. There's a nice blog post that goes through all the challenges of actually training one of these things and coordinating all the libraries necessary to make it stable and make it run. That's an interesting project.

Chris Potts:Very interesting, yeah. What's your day-to-day role like there? Do you mentor teams or try to set a vision or something in between?

Sasha Rush:Yeah. That's a good question. Mostly, I would say work with teams. I think "mentor" is maybe a little extreme. I tend to find that my biggest value-add is to help organize presentation of work. I will sit in, consult on when we write a paper, what to write about, what to include, what experiments to run, how to organize that and act as I would say intermediary between the open source part of the organization and the research part of the organization.

Chris Potts:That might be mentoring, Sasha. If that's not mentoring, I'm not sure what I'm doing.

Sasha Rush:I see -- fair enough, yeah. That sounds like mentoring.

Chris Potts:Mentoring actually in the style that I most admire. It sounds like you're there to nudge and to maybe make people aware of things, and then of course crucially to figure out what exactly should be in the paper, which I feel is a huge part of my job as a mentor.

That's cool. I want to return also to the point that you made about training large models, really large models, in academia. What's the limiting factor there? I'm absolutely positive that we have the talent or could cultivate the talent. That's not the blocker. What's the blocker to us just creating these artifacts?

Sasha Rush:Yeah, yeah. No, absolutely. Yeah, I'm actually not totally sure. For this project, we wrote a proposal to the French government to get access to a large amount of compute, so I think that's probably one aspect. It was a lot of money, not an intractable amount of money, but we had full access to this large French super computer for about a year.

In terms of people, it did require a lot of folks to get together and make decisions and plot out what model to train, how conservative or how aggressive to be about certain aspects of the project itself. Huge group to collect data, go through the data and ensure various properties of it, teams to work through some of the ethical questions and multi-lingual questions of the data itself.

Yeah. I think you're absolutely right -- there's no inherent reason why this couldn't be done in academia. I guess the blockers were more just getting these organizations together, collecting, compute, that kind of thing. I do know that there are some efforts, including one at Stanford, to do things like this, but I don't think they've trained anything at the scale of BLOOM yet.

Chris Potts:No, they've focused more on other aspects of this, like how we would benchmark. That's a current focus of that CRFM group. They also did a lot of training of GPT-2 sized models to give people a really rich picture of how the learning dynamics evolve and how different training runs vary with regard to high-level questions like social bias and performance on benchmarks and things.

Early on in CRFM, there was a goal to basically just create a GPT-3, and I think that everyone realized that that wasn't the best use of their time, but it's a little puzzling about why that wasn't. I think part of it is incentives and maybe those will change as we get more people who are doing systems plus ML plus AI for their PhDs. Right now, no one is really in that position, and so they shouldn't be spending all of their PhD time tending the machine in the way that you need to to train one of those large models.

Sasha Rush:Yeah. I'm curious to hear though, do you think it is the right use of time? Given that we see a bunch of organizations doing this thing, do you think it would've been better if they had tried that?

Chris Potts:It's an interesting question. I think when only OpenAI had these really massive models, my answer was, someone has to counter this and it would be wonderful if it was an academic institution. But now, it seems like we've gotten over that hurdle, and we have a bunch of these artifacts and enough of them that it feels like a little bit of an ecosystem. Things that Eleuther has done and BLOOM. There are a bunch of these efforts now.

Sasha Rush:Yeah, OPT from Facebook.

Chris Potts:OPT, oh that's another big one. They were very open about that.

Sasha Rush:Every NLP talk now starts with the graph of the log scale of parameters moving. There is a question of whether this just will keep on moving to the right. I'm sure that there are more data points that are going to be added to that graph as we go in the next couple years, and it might be interesting to just think of how we sustain continuing to keep up with PaLM and PaLM 2 and everything that might come out.

Chris Potts:Right. PaLM is not open, so none of us, or very few of us, really know what that thing is like.

Sasha Rush:Yeah. I don't know what it's like, but I've seen enough examples now that there seemingly are behaviors that are noticeably more wild even than GPT-3, so the question is, every year does some organization need to keep on training the next bigger model to keep up with that, if that gets larger?

Chris Potts:There's a few axes to this though. One is the parameter count and the other is just the amount of qualitatively different data that they're trained on. I don't know much about what PaLM was trained on. Have those details been released? We do know that even models like GPT-3 are undertrained relative to what the parameters might have the capacity to deal with.

Sasha Rush:Yeah, absolutely, yeah. While that's true, I think that even adds a whole nother axis to the replication process of having to both produce large scale data sets of super high quality and also get compute to do them. These are all time-consuming and man-hour consuming parts of the process.

Chris Potts:And are going to raise a bunch of societal issues that I think we're just now starting to confront around who owns that data and who has the right to it, and that'll happen across lots of different domains, and it might be yet another limiting factor on what kinds of artifacts we can create.

Sasha Rush:Yeah, absolutely. I don't want to take any credit for this because I wasn't really involved, but one thing that was very interesting about the Big Science BLOOM project is that they did get together a lot of folks from different facets of AI to discuss these questions and make these decisions in an open and replicable manner, and so you can go online and read about of how at least they did that process. It's fascinating to see that in practice.

Chris Potts:Sometimes when I can't fall asleep at night, I feel like I catastrophize. All the small things in my life that worry me turn into very big worries. I confess that, the other night, the one that was keeping me up was, what is GPT-4 going to be like? I was actually really feeling some kind of existential dread. What if this thing is qualitatively different? I haven't experienced that. I was already weirded out by GPT-3. What if this is, yet again, a kind of entity in my universe that really causes me to rethink things? Am I blowing this out of proportion or you also worry about this? Or, if it's not 4 then what about GPT-10?

Sasha Rush:Yeah. Holding off for now the societal issues of these models, i's an interesting problem to have. We could be in a field where nothing change. But it is startling to think just how much NLP and the qualitative behavior of these models has changed into a short period of time. It really has challenged a lot of my priors in, I would say interesting, if not sometimes disturbing, ways, I would say.

Chris Potts:Because just set aside all of the knotty philosophical questions around whether such a thing could be conscious or truly intelligent or understand. Just set all of those aside and just think about how people, non-specialists, react to these entities and the reactions that people have had to GPT-3 and then think, if GPT-4 is qualitatively different, what is that societal reaction going to be like? Because whatever you think about the understanding question, there is a continuum of behavior and it will look like GPT-4 is kind of like a three or four-year-old child learning language and grappling with the complexity of the world. Whether or not it's actually on that same continuum toward intelligence is irrelevant to how people are going to think about that entity, and things could get even weirder than they currently are.

Sasha Rush:Yeah. I think that's right. I do find that, as someone who thinks about language and studies language, to me that is the craziest thing and ties into the Turing Test and all these notions of how we think about things. Practically though, I've been actually surprised by how much stronger the reaction has been to some of the image generation and coming video generation work, where, for whatever reason, that viscerally hits people in a more striking way, at least in the short term. I just think in the next couple of years, video from text generation is just going to be really already pretty upending. Particularly photo-realistic videos from arbitrary textual descriptions is going to, I think, really change. I don't know. I still find myself having trouble telling myself this is not real when I see a video.

Chris Potts:I actually don't understand why we haven't already had a catastrophic series of world events that began with a hoax video or a set of images that led to a reaction that led to a reaction. Maybe it has already happened and we just don't know, but it's certainly in our future.

We know that text can do this as well because some of the early scary stories of AI run amuck are around a news story getting posted without a date, a news aggregator picks it up as a new piece of information, and then high frequency trading systems react to this and someone's stock plummets or the Dow plummets. That's a glimpse of how disruptive something false can be, and surely that's just going to grow and grow and grow.

Yeah, but I share your concern that image and video are where this is actually going to become really something we have to grapple with as a society.

It feels almost comical to ask this, but I'd be remiss not asking, at this point, about your bet, is attention all you need? You must contain multitudes because you took the "No" side of this bet. You think something else is necessary, but you work at Hugging Face, which is best known for promoting exactly this architecture. Of course, your own research, a lot of it is in structured prediction, which also feels somewhat opposed to or different from the Transformer architecture. But yeah, what's your view right now of this bet? It seems like you're going to win.

Sasha Rush:Well, let's see. Let's see. I had -- have -- a bet about whether Transformer architectures will be continually dominating the field on January 1, 2027. It's a bet with Jonathan Frankel, who's a professor at Harvard. I took the "No" side, betting that some other architecture would arise, and he took the positive side. If I go to the website now, I see I have 1500 days for this bet to come about. It's based on these long-term bets that people have made in the past in academia.

I made the bet because I was interested in this model called S4, which is a state space model that has a very different architecture than Transformer. To me, it looks closer to a structured prediction model, but you can also make an argument that it's closer in spirit to an RNN. It was performing quite well on some various tasks, and I thought it would be interesting to see if this type of model could -- or not this type of model, but some other qualitatively different architecture -- would dominate.

The other inspiration for the bet was I was writing a talk for a student conference called "What has Changed about Transformers in 5 Years?" I went back to the Annotated Transformer and other things I had looked at in 2017 and asked, "What really did someone need to know in the last five years?" The conclusion of that talk was actually pretty pessimistic -- that really nothing important had changed in the last five years, that all the innovations had been in datasets or pre-training or various extensions, but nothing about the core structure had really changed in a way that you needed to know. You could roughly take code written in 2017 and it would be pretty similar to how it actually is now. The question was: would that be true in 5 years from now?

I think I'm going to lose this bet.

Chris Potts:Oh!

Sasha Rush:It's my guess. It does feel like, to train very, very large models, you have to be pretty conservative. You have to plan ahead multiple years in the future, you have to collect the compute, you have to build out the systems, and so therefore, my guess is that people will not risk too many major, major changes. Even if there does exist some architecture that could be there, I think there are a lot of forces that say that that won't push forward.

That being said, I still hold out hope. You're starting to see some other various architectures that are coming around. I don't know. We might see something different in the future. There's so many people working in this space. I was hoping that someone would come up with something very new.

Chris Potts:Sure. What about diffusion?

Sasha Rush:Well, it's a good question.

Chris Potts:That would count in terms of being different enough from the Transformer? So many aspects of that are different.

Sasha Rush:Yeah, good question.

Chris Potts:Okay.

Sasha Rush:Let's say we take a model like Imagen. Imagen is this incredible model for diffusion. It's still is mostly T5 and then the diffusion model on top of it. I think there's a different architecture, but I don't know if that totally takes its place.

Chris Potts:I see. Because the bet is stated in terms of architectural choices and not the way the representations themselves are learned, and so you feel: if it's got that dot-product attention, then you lose the bet?

Sasha Rush:Well, we'd have to figure it out, but let's see. How do I think about diffusion? I'm not sure diffusion and Transformers are equivalent things. If you have a diffusion model that's mostly doing self-attention with Transformer blocks and the loss function looks like diffusion, I'm not sure if that counts or not. I'll have to talk to Jonathan about whether that's true, yeah.

Chris Potts:Well, this is one way you could lose the bet -- just by being overly generous!

Sasha Rush:That's true!

Chris Potts:There's another dimension though to what you said that I find really interesting, which is that you're implicitly or directly assuming that the scale thing is going to dominate and therefore, the price tag is going to remain enormous, and that's going to be a tax on exploration. And so, because of these incidental practical details, you'll lose the bet. But if the price tag goes down or the sizes come down, then there'll be much more room for exploring and then you could win, because someone could very quickly and cheaply train the next best thing.

Sasha Rush:Absolutely. No, no, totally, this is as much a bet about the progress of science and the sociology of science as it is about natural language processing, yeah. I think all these factors do come into play. They're all very interesting factors, even if they're slightly different than just the question of what is the best model with a hundred million parameters.

Chris Potts:Right. Do you worry though that there aren't enough incentives right now for people to just explore radically new things? Because that's another way you could lose -- just if no one really bothers to do the exploration. I guess I would've had faith that, people being people and scientists being scientists, they'll do random stuff.

Sasha Rush:Honestly, that's why I took the bet. It was a bet of me being an optimist about the exploration process of science. I want to believe that someone like Albert Gu, who worked in S4, just working in a lab with relatively small budget, could come up with something totally radically new and interesting, and I want to be behind that.

I'm not opposed to the idea that we, as researchers, prize models and architecture too much. I've heard nice arguments that say questions about engineering and scaling and learning rates are also interesting questions and lead to interesting research on how scaling works or how distributed compute works, and that actually keeping the architecture fixed makes a lot of these questions actually more interesting as well. It's also quite possible that I have a bias towards models themselves.

Chris Potts:Oh, I know exactly what you mean. I try to value all of these things. I don't personally want to be doing all of these things, but I feel like I'd certainly value them. Like, I don't want to myself train one of these massive models, but I'm glad that people are doing it and I think it's incredible work.

But I would, just getting back to the things you mentioned at the start of our conversation, it would be remarkable if the Transformer was somehow the optimal choice. And that alone -- the skepticism about that -- should lead us to want to do a bunch of exploration. Certainly, there are many incentives, or should be, for trying to find architectures that would be more efficient to train and would require fewer parameters.

Sasha Rush:Oh, absolutely, yeah. I guess one argument for why the Transformer will remain dominant is that Transformer works now. In order to make these models scale better, it will be built into the hardware, then it will become more dominant, and you get this loop, where it becomes even more difficult to explore, which I think, in a long-term sense, is negative, but maybe in the short-term sense is the fastest way to get to larger models.

Chris Potts:Well, I tell my students, don't worry that we're so much poorer than Google. Sometimes being poor can enliven your imagination and lead you to all sorts of new things, so maybe that'll be a good constraint on us.

Sasha Rush:Yeah. I love that argument, yeah.

Chris Potts:One more dimension to all this that I want your perspective on concerns prompt engineering. I can see two perspectives. The Twitter one is to be very cynical about this, that we've now entered into the business of entering text and seeing what happens. We don't understand why we're doing it, we don't understand the system response, but sometimes good things happen, sometimes bad things -- very heuristic and unprincipled.

The other side of this would be that there is something real to discover there, and also that if we could design systems this way, using prompts, it would just open up access to system design to so many more people. I don't know. Is that the right continuum? Where do you sit on that?

Sasha Rush:Yeah, yeah. It's a really good continuum. If I'm being honest, I was at ACL, I went to a bunch of posters about prompts for different tasks, and it did make me relatively depressed. I think it's depressing in the sense that it feels like we're almost clients to the model. We're trying to decipher the way the model works and come up with a way to talk to it. It felt almost like a little, I don't even know how to put it? Demoralizing. I thought, really? My life is trying to figure out how to communicate with this pile of parameters?

That being said, I generally take the view that NLP is a field about producing systems to help people work with natural language. And certainly, having a new UI for how ML models work is extremely interesting. I do find that, for many years, people would come to me and they would say something like, "I'm a journalist. I have a lot of Supreme Court records. I'd like to figure out all the time some event happened in these records." I would say something like, "Oh, you got to annotate data, you got to train a model, you got to do this sort of setting." If the real promise of prompting works its way out, such that someone could really pay $10 to say some API and type in what they wanted the model to do and get an ad hoc model of that form -- it's hard not to be excited by that from a pure usability point of view.

Maybe I'll put it this way. I'm cynical about some of the research questions being asked, but I'm bullish about this as a kind of area or potential application of the technology.

Chris Potts:Yeah, and that's common, right? When something new happens, you get this smattering of experiments, some principled, some unprincipled, as people figure out how to home in on the actual questions that we need to ask, the ones that are important, and then, probably we're heading pretty rapidly toward understanding what kinds of things are useful to do with prompts. That seems like all part of the process.

I also love your description that you didn't imagine you'd spend your life figuring out how to communicate with a bunch of parameters. This could be a very strange and alien description of what it's like to get to know other people as well.

Sasha Rush:Fair enough. There are more benefits to getting to know other people, I would say!

Chris Potts:I agree, I agree. But wait, maybe GPT-4 or GPT-10 -- wonderful companions to us all.

Sasha Rush:Possible.

Chris Potts:Do you have time for a few just more personal questions to wrap up?

Sasha Rush:Sure, of course.

Chris Potts:First, since you lead this dual life Hugging Face and Cornell Tech, are there aspects of the startup life that we should import into academia? Things around recruiting, evaluation, day-to-day work?

Sasha Rush:Yeah, absolutely. I've actually worked at a couple of different startups. I was a software engineer, actually, before I went to graduate school. I do try to import different parts from both of the worlds. The most basic one, I would say, is less meetings. I know it's cliche, but it's just amazing how places like Hugging Face really just protect their employees' times in a very extreme way. Academia seems not to have this at all as a thing. Shorter emails, less meetings -- just really being serious about the fact that everything is around you trying to prevent you from doing research and you really need to be aggressive about cutting that down.

The other thing I would say is, getting back to some of the earlier questions you asked about software engineering: I don't actually think that we need more software engineering in papers, but I do think we need it in advising and groups. I think the ritual of a code review in industry is just such a good thing, and it's the best way I know for people to improve on their writing. I think things like code review, or the equivalent of that for paper writing, is extremely valuable.

Then I guess the last thing I'll say is that I think recruiting in academia, particularly for admissions, is extremely out of date. Things like the single decisions on students or the delay in time between when people apply and when they start. All those things, I think, are just really, they're very challenging both for professors and also for students, and I don't think they need to be so hard.

In a field like NLP, there are so many great professors at so many universities, and you get this weird bottleneck where people, I don't know, apply to only a couple of places and they get rejected, and other people have to play this strategic game. I just wish that whole process was easier for students.

Chris Potts:It's a great point about the delay between acceptance and start. That does feel unnecessary. We dispense with that more or less as soon as you get to the postdoc level. I guess we return to it for faculty. But for grad students, we often make them wait eight months.

Sasha Rush:Yeah. From start of writing your essay, it could be a whole year.

Chris Potts:Yeah.

What about the harder direction? Are there aspects of academia that should be imported into the startup world or into industry?

Sasha Rush:Oh! Yeah, let's see. I would say the biggest thing I find, when people in industry start doing research, is that they really struggle with figuring out what a research question is. It's not to say that the problems they're working on are not relevant or couldn't be turned into research problems, but just this idea that research is not about quantity, it's about figuring out a specific thing that's unknown in the world and trying to provide some insight into solving it. I just find that we have a lot of conversations about what it is that we're studying in this project, and how do we find a specific area that we can contribute in a particular way.

I think once people in the startup world grasp that or understand that, they can write wonderful papers, but it's just a different mindset in some ways. The other thing I'll say is just that, I think startups mostly probably shouldn't be doing research. Hugging Face is an interesting example, but startups are extremely hard and you have to figure out who you are or make money or do your thing. The thoughtfulness and time required to do some of these projects is very hard to find a lot of times in companies. For students who really want to have to do real deep research, I really do think getting a PhD can be extremely valuable on that one.

Chris Potts:Oh, sure. Yeah. No, that's another dimension. I was thinking more -- what about this puzzle that, yeah, it's hard to think of things that should be imported from academia into industry, maybe. On the other hand, universities last much, much longer than corporations do, so as organizations, they are somehow much more hardy and over the course of their lifetime will produce much more innovation than any individual company, so there must be something that we're doing right in this chaotic way that we all do it that could be brought into an industrial setting.

Sasha Rush:Yeah. Let's see. People often ask me why I'm in academia. I often don't even understand the question. Academia is really wonderful. Teaching is critical and necessary, and that aspect is important. But also, the stuff that's going on in industry right now is remarkable, but it's extremely narrow. Facebook is not Bell Labs. It's not a place where they're doing research across many, many aspects of human life and challenging problems. The fact that there are couple places right now doing really good research in deep learning doesn't imply that industry has somehow cracked how research in a broad sense should be or that it's a replacement for academia.

Chris Potts:Right. That's more like a message that all of these different kinds of organization are playing their role in terms of achieving innovation. Academics do their chaotic thing. When it stabilizes a little bit, it becomes things like startups and the focus of large companies, and none of these entities should try to emulate any of the others. I think that's a reasonable perspective.

Relatedly -- this is more low-level but maybe on the same theme -- I think students are interested in your approach to advising. Do you have a grand plan for your group at Cornell Tech or is it more laissez-faire?

Sasha Rush:Yeah. I guess I'll start off by saying that there's some things I feel I know how to do as a professor, and there's some areas that I still find extremely challenging. I'll be honest and say that strategic planning and future vision is an area that I think of as a weakness that I have. I think one way I compensate by that is that my mentoring strategy is that I have students who are relatively independent in their project selection and the areas they work in. But I'm very hands-on in the specifics of how they execute. I tend to have students who have some area they're passionate about that they push forward, but I require them to write proposals, give me work plans, that thing.

I think one of my students referred to me as a research compiler. He comes to with problems and I just say, "Error line 12," I just spit out all this stuff. My mentorship strategy is that I'm very involved with writing and presentation and things with papers, but I don't try to keep them along my strategic path or push them towards a group of goals in that regard. Now, that's not to say that's at all right. I'm often very admiring of professors who are able to keep together multi-year, 5 to 10-year, projects that all culminate in some great vision. This is just the way I found works for me.

Chris Potts:It takes all kinds in this regard because in some measure, the students also differ. It sounds like your students ought to be very entrepreneurial, but there are also students out there who would like to be told what the big questions are and then really do innovative things with those guardrails on, and everything in between, and so I feel we should all be doing what feels natural to us , as opposed to trying to have some ideal. The future planning that you described may be a better path, but it could be catastrophically bad. If you make the wrong bet, you've ossified, whereas your current approach is going to be very robust to changes in the intellectual scene and in your own taste and everything else.

Sasha Rush:Yeah.

Chris Potts:I'm like you, so I'm telling myself a story about why this is the right approach.

Sasha Rush:Yeah, no, it's also possible that I started as a professor into a very chaotic environment, and so maybe that's what I got pushed towards, yeah.

Chris Potts:Final question. I was an undergrad at NYU and I lived in New York for a while. I'm glad I did it, but I never really became a New Yorker, I feel. How about you? You're a relatively new arrival. Have you become a New Yorker?

Sasha Rush:Yeah. I've been in New York for about three years, but I also lived here about maybe about 10 years ago. I was a grad student at Columbia. I'm definitely a New Yorker, and I think the pandemic solidified that. Got through it in about a 600-square-foot apartment with my wife, and that was a tough two years. But yeah, let's see. I'm trying to think of New York things that I end up doing. I play tennis, and getting a tennis court in New York is insane. I often wake up at 6:30 in the morning to try to reserve a tennis court. They're used 24/7 all the time, so you have to get up and get in line to get one.

Then, I live in Brooklyn, but my university is on Roosevelt Island, which is in the middle of the East River, between Manhattan and Queens. I normally bike to work. We have relatively good bike lanes now to get there, but there's always a bus or a truck parked in the bike lane, so you have to knock on the wall or try to get them to move to actually make it work.

We have good coffee shops here and they all turn into bars at like 4:00PM, so I'm often trying to write my final line of the paper and get a coffee before they turn the lights really low and turn the music up and kick me out. I've gotten used to that as well.

Yeah, New York's great. I think it's a great place to be a graduate student, and it's become a center for NLP. We have great folks at NYU, at CUNY, at Columbia, bunch of people from Google, Hugging Face is here, Amazon. Yeah, lots of folks around the city.

Chris Potts:I take it you've moved. The podcast from Sean Welleck that I listened to, I think you lived in Queens at that time. You moved to Brooklyn?

Sasha Rush:Yeah, I moved about a year ago, yeah.

Chris Potts:Where in Brooklyn?

Sasha Rush:Actually, not that far. I'm in Williamsburg.

Chris Potts:I lived in Park Slope for the last year I was an undergrad. It really did me good to move out of Manhattan because I had all the advantages of the city, but I was right near Prospect Park and in the day it would calm down a little bit, whereas Manhattan never calms down. Because I think what I found hard about being a New Yorker is the constant sensory input, whereas I think people who are true New Yorkers just thrive on that like all the energy. Actually, you must be like that because you described with real enthusiasm how difficult it was to get a tennis court and that's just a byproduct of all of the energy and excitement and movement and everything else.

Sasha Rush:Yeah, it's true. I'm definitely one of those people who likes when there's people on the streets at all hours and just constant noise and things of that form. Obviously, there are downsides, but I don't know. I think I'd prefer this style.

Chris Potts:Wonderful. Well, thank you so much for doing this. This was a wonderful conversation. I feel like I learned a lot and have lots to think about. I really appreciate it.

Sasha Rush:Yeah, thanks so much for having me. This was great.