CS224U: Natural Language Understanding

Podcast episode: Richard Socher

May 2, 2022

With Chris Potts and Dhara Yu

The early days of the rise of deep learning in NLP, conviction, the importance of applied work in the current moment, start-up risks, the state of Web search, paramotoring, and over-looked gems in the U.S. National Park system.

Show notes

Transcript

Chris Potts:All right. Welcome everyone. Thanks so much for joining us today. I hope most of you are outside enjoying the fine weather and non smoky weather, or whatever else we get to do now that COVID is somewhat subsiding, at least for the moment.

I'm delighted to welcome Richard Socher. Just by way of quick introduction, let me just boldly say that I think his research done as a PhD student at Stanford was formative in revolutionizing NLP to bring in deep learning. It's really a remarkable body of work. If I have the chronology right, after his PhD, he founded a company called MetaMind and that was pretty quickly bought by Salesforce. And when it was bought by Salesforce, Richard became Chief Scientist there, and that coincided with what looks like an incredibly vibrant and innovative time for Salesforce research, and it looks like lots of young scholars and careers were nurtured while Richard was there, which is very exciting.

He has since left Salesforce to found a new startup you.com, which I want to learn a lot about today, and he's also now founded an investment fund, which we should also learn a little bit about. And last but not least, certainly, he is, I think, the only one of our distinguished speakers this quarter who is also an alum of this course, and that's very exciting. So, we can ask him about what he did for his course project and how he got it right and what that launched for him.

And I also want to hear about your various dangerous seeming hobbies, Richard.

So, welcome! And I wonder if we could just go back in time a little bit. I am curious to know, do you remember taking NLU in what must have been 2010?

Richard Socher:Of course, I do. Yeah. It was a great class. Very fun.

Chris Potts:It was the year before I taught it, but I kind of audited it, because I was hoping to inherit it from Dan, because at the time it was taught by Dan Jurafsky and Bill MacCartney. And so I watched very closely and I can actually remember your final presentation and I have your paper here. I was going to quiz you a little bit about your paper. Do you remember the paper?

Richard Socher:Yeah, I think so.

Chris Potts:"Learning events for narrative schemas with multi-label spectral clustering."

Richard Socher:That's right. Yeah, that was probably one of my second or the last project that I did before I was fully on the neural networks mindset.

Chris Potts:That's right.

Richard Socher:But even the spectral clustering – it's a way to take vectors, map them into some lower dimensional space that captures some properties of the manifold into what should, that you should have assume is there in a higher dimension. Even there, I realized, man, the most important thing is like: what are the vectors that we're then doing all the other stuff with afterwards? We need to find some really good vector embeddings for language. It was another motivation, together with some others that came off from the future engineering issues that I saw, that made me want to work on deep learning.

Chris Potts:It's also striking to me because it's a very intricate and difficult empirical problem, this kind of narrative schema thing, which I think is also a hallmark of the work that you did – serious compositional semantics. Do your tastes also run to that kind of very intricate empirical problem?

Richard Socher:Yeah, for sure. One thing that's so fun with language is that it's just this ever-evolving system that humans can make up. You have some teenagers and they say "YOLO," and now you have a new word, and now you have to study it and understand where it came from and how it's used and everything.

At the same time, math is this thing that is still true like 10 light years in that direction, and you try to use that very sort of powerful general tool to deal with this very interesting, messy, human condition. And so, it's a very interesting, and I think narrative schemas are one. To be honest, I kind of mostly took the narrative schemas from another PhD student. I want to say David Chambers.

Chris Potts:Nate Chambers. Yeah.

Richard Socher:Whew. All right close, but not quite. Nate was working on these problems, and he had defined this very intricate hierarchy manually, and I wanted to just learn how to at least map into it and cluster it more automatically than with the various sort of intricate features that he also used to get there.

Chris Potts:And so you mentioned that, at the time, you could see this deep learning thing start to happen. What were those moments like? I think we got to Stanford in the same year, 2009. Is that right?

Richard Socher:Yeah.

Chris Potts:But I wasn't really part of these discussions. So, I've heard tell. But what's your reconstruction of the rise of deep learning for AI, I guess, but in particular maybe for NLP, here at Stanford?

Richard Socher:Yeah. Here at Stanford. So, it basically all started in the computer bision world, and in particular in Andrew Ng's world. And I can still remember very vividly how Daphne Koller's like, "Ah, Andrew could work on so many amazing things. Why did he have to choose these neural networks things? They're like the weird sub-niche that is always promising stuff, but never quite delivers as well as SVMs or graphical models." But Andrew loved the idea to replace feature engineering and actually have these features be learned, and so I saw that in Computer Vision.

I did my undergrad in Linguistic Computer Science. So: a lot of Linguistics and also a lot of computer science and math, mostly like 80%, 90% CS and math, operating systems, hardware, all of it. And then 10% of the minor in Linguistics. But then I enjoyed the minor so much in my first year and a half. I did the bachelor's and master's minor in Linguistics. That was back in Germany. So I did a lot of linguistics, but then I felt like the linguistics folks and even the computer linguistics and NLP folks in Germany were just doing too much manual definition of everything. I wanted more Artificial Intelligence, the AI kind of figuring out more details, instead of going through many specific examples. And so, I switched in my masters to do more computer vision, like segmentation and medical images and things like that. Then in the PhD, I'd always wanted to go back to language, and I finally found Chris Manning, who's done a lot of AI.

But I also realized when I worked with some of his PhD students, like Jenny Rose Finkel and others, that they wrote papers about really cool, conditional random field models on directed graphic models. And the paper was like 80%, 90% about the CRF, the Conditional Random Field and the machine learning model. But really they spent almost 80% of their time designing the features that went into that model. That can't be right! I don't want to sit here and spend all my time engineering features and then talk mostly about the model that didn't seem as crucial to making it a state of the art model as the features were.

So I was kind of not happy about that, but I didn't know what to do about it. And then I saw the deep learning folks in computer vision say, "No more feature engineering, you just give it raw pixels! And it will figure out that important features are edge detectors and then corners, and then sort of combinations of those and deeper layers that are eventually like fully-formed eyes that are combined of multiple edges and things like that." And I was like, "Wow, that is really cool!"

But at the time all of that research was on literally 32 x 32 pixel images, like tiny little MNIST digits, or tiny little like thumbnails of real images, and they're all fixed size. No one was working on sort of variable length, anything. It was all fixed size images. Basically I thought all my linguistics and syntax and Noam Chomsky and X-bar schemes and all of that, and formal semantics and various parsers and HPSG grammars and all these things – they're all compositional in some ways, because language is compositional. And I also knew that in order to convince the NLP folks, I had to take a new model formalism, but also combine it with things that they cared about.

This sort of crystallized itself over the next like year or two after that. So, I basically, somewhat embarrassingly, reinvented from scratch Recursive Neural Networks that basically take word vectors and then merge them in a syntax tree more and more. And first I just had this idea of like, "Oh, I'm going to stick to roughly the syntax of these and maybe it can be arbitrary. Maybe it's just always sort of the same symmetric structure of the tree, but they should kind of merge to larger phrases and so on." And then at some point I realized, "Oh wow, I'm actually using that same function over and over again. So, it's technically a recursive function and then I could call the recursive new network." Then I was very excited for two weeks or so that I invented this completely new model.

Then I was like, oh wait. At the time still, I googled it, and then I realized, "Oh man, this was actually invented in the eighties before, on tiny grammars of like 10 words or so with binary vectors," but the models had been around. But no one had ever used him for anything really useful. I was still excited because it was, at this point, my baby, and we had these really powerful word vectors, and so, we basically just kept at it.

We eventually combined Recursive Neural Networks, not just for language, but also for images. While a lot of my papers got rejected in the NLP community for a long time, it gained a lot of traction in the AI and machine learning community. One won a best paper award, 2011 at ICML, using the same network architecture to map sentences into vector space and images into vector space, and she did the first work to map them into the exact same vector space. So, I can find images based on sentences and vice versa, like sentence descriptions of images.

At some point I was like, "This is clearly the way to go," but all the way until 2014, I could convince almost no other PhD student to work with me on neural networks. They're all like, "Ah, Richard, that's your niche. I want to continue doing what the main community thinks is the right thing to do," but then, sort of after 2014, a lot more people got into deep learning for NLP. We started teaching the first ever class on Deep Learning for Natural Language Processing, which eventually, after doing that for two years, Chris Manning came, he's like, "Okay, well every model that's state of the art is now a deep learning model. I don't really want to teach the NLP class anymore without deep learning. So, why don't we just teach it together?" We did that, and then he went on sabbatical for a year, adn I taught just 224 and then yeah, I was just too busy with other things.

Chris Potts:There's tons of stuff in there that I want to unpack. The first thing that you said that really resonates with me is this notion of writing feature functions. Because, to this day, since I'm a linguist, people often seem to assume that I long for the days when you got to write feature functions by hand. And my response is always like, "Well, I found that kind of dismal because I have a theoretical idea I approximate with this feature function, the approximation feels wrong, and I have to hope against hope that my model knows what I'm trying to do and has the kind of data coverage that would make this impactful." Fully half the time none of those things fall into place. And so you just feel that you're in this cycle of kind of like not quite doing the right thing in hopes that some aspect of what you did will have some empirical value. Whereas now, in the deep learning era, you get these holistic representations of everything, which clearly embed lots of interesting latent structure that matters to me as a linguist. That's significantly more exciting than that old hacky stuff we used to do.

Richard Socher:I'm glad you are like that. Chris Manning also, to his credit – when I came to him and I had all these ideas from the computer vision world and neural networks, I still remember that meeting – he said, "Look, I don't know anything about neural networks right now either, but I'm willing to learn, and we'll get through it together." And obviously he knew everything there is about NLP in general. So he helped me kind of bring those techniques, and sell them in some ways – because science is also a sociological system and you need to convince people of your ideas – like sell it to the rest of the NLP and linguistics community and was really, really helpful. And now obviously he's also a world expert. To me that's actually something that is not as common. A lot of people kind of – and even professors – will crystallize at some point and be like, "This is my thing." And then they kind of fight ideas that will like make their thing not as important anymore. It's one of the reasons why Stanford is so amazing – that people like you realize like, "Oh wow, they're new ideas. Let's just work with them and then like do new things." So it was really fun.

Chris Potts:So, I love all that. I suppose there's one aspect that I can kind of sympathize with, though. Just think back to 2008, 2009, 2010. A student comes to you and says, "I want to do tree-structured neural networks." And your only experience is the Smolensky paper that you're alluding to, and other stuff that seemed kind of visionary but impractical and has never been proven empirically. And you you think, for the student, "Well, I'm going to put them on a very risky path. And if I just nudge them over to something that's a little safer – like say CRFs, well-understood, successful empirically – both of us are going to be happier, but fundamentally I won't feel like I have tricked the student into doing something that was suboptimal." At the time, as an advisor, you might really have worried. And I suppose you got that kind of reaction and just pushed through it, or with Andrew pushed through it, or with Chris pushed through it, or something fortunate like that. Is that right?

Richard Socher:Yeah. And to be honest, there's a little bit of stubbornness and sort of excitement too, that you have to have – and conviction that there is a fundamental issue with the way things are currently done. And you think that your way of thinking about all these problems can be eventually better. When my first papers got rejected, I basically had to compete with like a decade of feature engineering, and I could train a model from scratch with zero feature engineering and my model would get almost as good or just as good as the current state of the art. But it wasn't yet like 10% higher in F1 or accuracy or something. And so my papers all got rejected. And I'm like, "But why? It's already so exciting. I just skipped 10 years of feature engineering in a two months project and the whole model learned everything from scratch by itself."

I thought it was already really exciting, but it took years and many rejections from conferences. I wish I saved these. They must be somewhere in my email archive – but different formats and downloaded and stuff – but there's somewhere where I have rejection emails where they said, even the results were good. They're like, "Oh, but neural networks are this thing from the '90s. Why you submitting this to NLP conference?" And I'm like, I was so crushed at the time. It's the one thing you're obsessed over for four or five months, right? And then your paper gets rejected. It's you have to kind of be stubborn a little bit and just keep at it, if you have high conviction. And in that sense, it's kind of similar to start-up life actually, where it's also a rollercoaster and you also get rejected in various ways for your ideas and you have to kind of keep at it.

Chris Potts:So, one meeting we had that I remember, and I guess I kind of hope you've forgotten it (although I'm about to remind you) is when you did come to my office for whatever reason and describe what must have been the basic tree-structured neural network with the single composition function, that's just like some weights. and then you just do the kind of concatenation, and you can do that recursively. And I think you successfully showed me that this could be a recursive process. And then I was like, "Wait, but all you have is this one set of parameters W?", and you were like, "Yeah that's fine!" And I was like, "Okay, that's fine," but you weren't my advisee. And I think I already knew that you kind of knew what you were doing, but I think I was like, "Well that's probably not going to work one set of parameters for the entire language. No way. Surely, you need at least different parameters for no phrases and verb phrases," and you were like, "No that'll be fine. It'll be fine."

Richard Socher:That led the paper of compositional grammar, RNNs, and made them even better. The models are now even larger, but I think the biggest one was that we also had all these parameters in the word vectors, and that's really a vocabulary of 40,000 words, times like a hundred or so numbers. So, that's where even more of the parameters were hidden, so to say, but not in the composition function. But yeah, we had to eventually make the composition function a lot more powerful. And now of course we have even models with millions of parameters in those composition functions, multiple layers and everything.

Chris Potts:Right. I wonder if I would've been reassured if you'd reminded me about the embedding layer. I might have been like, "Well, okay, still."

So, I think I'm getting a sense now for how you ended up doing tree-structured networks and being the first to put those into practice empirically, because it was a fusion of your proper training as a linguist in Germany with all this other stuff that you aspired to do. Is that the history essentially – that you were trying to find common ground, or did you believe in it, or some mixture of those things?

Richard Socher:Some mixture. I pitched other ideas, and there's really only one paper in deep learning and NLP before mine, that inspired me and was interesting and actually had real results: Collobert and Weston 2008, in ICML. They had just worked with fixed windows around words for classifying just the center word by using two or three words to the left and right of it. But they also had word vector embeddings and things like that. Very first versions. And they trained things end-to-end with like a one layer neural network. And so, I thought: we clearly have to deal in some principled way with the variable length. I kind of threw out ideas of just like, "We just do left to right," but the models that we had at the time were so small that you lost so much in that context, that I felt like they weren't quite ready yet.

And then there's this little bit of this idea that you bring up, which is like, I wanted to convince the linguistics and NLP crowd also, and it felt like the right thing to say. "Yes, everything's a vector, everything's a neural net, everything's end-to-end learned. There's no more future engineering." That's a lot to digest for people who have spent their careers doing that. But at least like you still have some things that you're familiar with, which is syntax and grammar and things like that. So, it was a little bit of a Trojan horse, if you will, into that community, to convince them.

Now what's interesting is we see more and more folks doing research in outside of academia where they don't have to convince anyone. They just have to get really great results. And it doesn't really matter how, and that allows them and sometimes to be even more free in their models and also their time horizons, right? Like models like GPT-3, which is really exciting. We can talk about it a little bit later, but I know the years up to GPT-2, I worked with Stephen Merity and others on the most accurate language models. We had the lowest perplexity for several years. (Perplexity sort of just how well you predict the next word, and the lower it is, the higher probability mass you put on the correct word that really does happen next.) And so we have the best language models.

I tried to convince him to say, "Hey, let's just scale this up a lot and just see if we can generate something interesting." And he's like,
"Ah, it's a lot of engineering, and just I want to work on the models, the AI models and machine learning models, something interesting, novel mathematically, and so on. And conceptually interesting rather than just like spend all my time engineering and multiprocessing systems, GPUs or TPUs, or having to train on multiple machines, you have to do a lot of engineering to train these systems." And I didn't have enough conviction myself and I didn't feel like I wanted to force people to do something they didn't want to do. And so we didn't, and then GPT-2 came out and was like way, way larger, but the model was the same and the idea was the same. It's just much, much larger.

So to have that conviction was really amazing – that opening, I had to just say, "Look, we don't have to actually innovate on these models anymore. They're general function approximators, they're really smart. They're really good models that are big enough and have enough of a sort of general function approximation capability that we don't have to innovate on that front anymore. We just have to make them larger and larger." To have that conviction and then actually spend millions of dollars training, a single model and have the money and the funding and the salesmanship to convince people that that will be useful to do. And all of that. It's an exciting time that we're living in the last couple years.

Chris Potts:So, but just a couple quick follow-ups on that. So, I recall that you have a paper that came out just before or around the time of the ELMo paper that had the same idea about contextual representations. Is that what you're referring to? Or is this later?

Richard Socher:This was even before that – sorry, this just around actually roughly the same time. Yeah. So, CoVe, Contextual Vectors.

Chris Potts:CoVe, yeah! But that's the same story, right? If you had scaled it up, it would've been ELMo I'm guessing, right?

Richard Socher:Yeah. So, the ELMo paper, in their defense, cited us in like 5, 10 different places, and they said, yeah we just took that idea from CoVe, but they replaced translation with language modeling and that was their innovation. And then BERT just replaced the LSTM but kept language modeling and replaced the LSTM with Transformers. That's where, for a lot of people, the history starts – with BERT. But, yeah, there were a lot of ideas that were sort of in the community.

Chris Potts:Absolutely. You've cued this up nicely Dhara. We can't keep these students at bay any longer. What's this overarching student question for Richard?

Dhara Yu:Yeah. We have a couple of student questions that are all sort of circling around the same idea. They all want to know what you think is the next big thing in NLP – beyond neural networks or specifically the Transformer architecture or if there is anything that is next.

Chris Potts:That was one of my questions Is attension all you need? But we can get to that later. Yeah.

Richard Socher:Okay. What is next big thing? It's kind of interesting, but you might not like this answer, but here, there are various ways to answer this. And then tell me if I'm spending too much time on it. But I think, actually, there are times where you can have a lot of impact in pure research, and then there are times where you actually can have more impact in the world by taking that research and applying it to a lot of problems. So, when electricity was invented, it made a lot of sense to figure out AC/DC and like how to transport it and then how to generate it, and different kind of electrical engines and gramophones, like all these different things, but then for a good couple decades, it just made sense to take the idea of electricity and make it light up every city that you can find, and take anything that moved and find ways to make it even better, and just apply this idea and build applications.

I actually think right now we are a little bit – and it's one of the reasons I stopped focusing on pure research for now – we're in a time where we've done so much innovation and there's still so much more that can be done with the existing models in terms of all different applications that you can actually have more impact, if you will, in the applications and industry than in sort of doing pure research.

For NLP: you have law, automating various aspects of law. You have chatbots that you can have conversations with. Some people think that's sad, but Replika – a company I invested in – is really powerful. They have some really powerful stories. It's essentially a chatbot friend. Started from the chat logs of the founder and her dead friend that she wanted to still have conversations with and was really sad to have lost. It helped her to have these conversations with this chatbot trained on their old chat logs. And now they have incredibly powerful stories of people. They asked if they can share it, and those people said yes, who told them like, "Look, I was going to commit suicide and your bot convinced me not to do it, and I thank you." I sometimes get random wacko emails of like, "Is it a real intelligence? Is it conscious behind there?" And I'm like try to tell them it is just a large model. And some people think that's sad, but I think it's actually very powerful and it's essentially like a journal that is friendly and supportive and asks you questions back.

So this is another application. There's so many others: helping lawyers keep track of time or write simple legal documents that would otherwise cost a ton of money or take a long time to write. Obviously search is the biggest one that we can do in accessing information. Summarizing information is incredibly important. And it's unclear to me that we need a different general function approximator. It's also clear that the models that we're currently exploring only are those that work well on the current hardware, and that is a major restriction. We always like to think like, "Oh, just something beautiful and brain-inspired or theory-inspired or something." But really, the thing that hardware is really good at right now is large matrix multiplication. And so we're only thinking about models that have faster matrix multiplication.

To be honest, if it wasn't for that, but if we had different hardware, like more CPU and multicore CPU architectures rather than really massive GPU architectures, then maybe eventually recursive neural will make a comeback. They're also just general function approximators, but because they're so much slower on the current hardware that is really good for just large matrix multiplies, they fall out of favor.

To be honest, in some ways, it doesn't even matter anymore. Just take any general function approximator that is very highly paralyzable and easily trainable. It's efficient. You can optimize it. It has lots of sort of knobs to tune, to help you get out of local optimization and then just think about what the objective function is. So if there was research that I think we still need to do, it's actually less on the models and more on what objective functions are these models trained on.

If you want to go even further, I think one really interesting philosophical problem almost – but also sort of towards an AGI roadblock – is to think about letting an AI model actually choose its own objective function or have a sequence of increasingly more complex objective functions. It's a little bit inspired by curriculum learning. And there are some papers in that in deep learning, and we actually also discovered anti-curriculum training – to actually start with the hardest task and you can throw the simple task on later and it'll be easy and fine.

But this idea that eventually to really have a conversation with an AI, you need to have the "I want something to achieve something." Right now, it just predicts the next word, and you can have conversations, but the conversation will just meander wherever you push it, because the AI doesn't really have a goal of like, "Oh my goal is to convince this person that, I don't know, vaccines work or don't work." (I guess, humans can choose what they want unfortunately.) So to have a kind of higher-level objective function than just predict the next word, that is what will drive, I think the next wave of innovations. And then of course, there's a standard stuff of finding interesting data sets and just applying them to lots of different problems. I could talk about it for much longer.

Chris Potts:Can I pull out a few themes from your answer? So just going in chronological order: for the things that you pointed out is really exciting, the current developments, a lot of them are about creative expression. And in one of these previous discussions we talked about how one of the 2022 best long form essays that was selected was actually a woman who had used GPT-3 to help her write about a loss in her family. And it had a similar tenor to the one you described about losing a friend and then continuing conversations with a chatbot. These are not the kind of hard-nosed applications that we tend to pose for ourselves that would essentially be oriented toward market research or as you pointed out, the legal domain. What do you make of that? Is that a continuing trend? People love the DALL-E 2 two pictures. They have no real purpose beyond just creative expression, right? Should we embrace that as a field?

Richard Socher:Yeah, it's really fascinating. A lot of people thought that creativity was sort of the last bastion for humanity that AI will definitely not be able to mess with, and that was very wrong prediction by a lot of people. I'm very careful about my predictions. So I did not make that one. I actually think it's incredibly inspiring to see that AI can do that, because what that means to me is that the bar gets lowered and it's much easier for people to be creative.

It's not a perfect analogy, but think about the book press, the printing press, Gutenberg, the original printing press. You could say, well, a couple people lost their jobs who would transcribe books manually and have really beautiful calligraphy and everything. But really it meant that more people were reading books and more books were published and read and produced and distributed and everything. I think, similarly, what we'll see with sequence models broadly construed, that includes music and our visual art and poetry and writing longer form texts in the future, is that people will be more easily letting their creativity out because the tools basically make it much, much easier for them to just have an idea and then have all the details kind of almost be automated for them. And that will, I think – similar to the book and the printing press – unleash creativity.

So now, you can just say anything to DALL-E and it'll give you a visual representation of that. The idea that you had to ask it to do that, that's what will matter more. If anything, the sort of skills of actually drawing and being very good visually and drawing realistic things and stuff, those will be less important, but to make DALL-E do something interesting – that initial spark of creativity – that's the kind of skill that will be even more important in the future. I'm excited about that application too, but again.

Attention, it's not all you need for the long term. You need to think about other things like objective functions and long term memory that most of these attention models don't really incorporate, which is why they're pretty bad at writing really long texts. We actually worked with an award winning writer in Germany with GPT, with our version of roughly sort of maybe GPT-2 and a half equivalent, that we had worked with at Salesforce. It was fun for a paragraph or two, but once he already had multiple people, and some got married, some got murdered, or whatever, and then he wanted to write another paragraph – it didn't know all the backstory of those characters at this point and so it just started creating less and less useful paragraphs.

So long form AI in NLP is an open area of research that I think we'll see more results in, but it's also getting harder and harder, because the longer form it is, the fewer good training data points that you have because they're not as many books as they are. Just random paragraphs on the internet.

Chris Potts:Just one more thing. For the second part of your answer, I actually heard that as a call for more research, not applications, because it sounded to me like what you were saying is: our models just do matrix multiplications. In turn, we have a lot of hardware that is oriented toward making that fast, which is leading people on the modeling side to do even more matrix multiplications, which is leading to a narrow strand of innovation on the machine side. That sounds like the sort of situation that should be disrupted, but that will be disrupted by researchers, not people doing applications – or no – or what?

Richard Socher:Obviously I will never argue against stopping research or something! So yes, we should continue doing research! There are some very interesting companies around quantum computing, and just getting much faster, and rethinking very fundamentally the bit .and thinking about it as qubits and everything, and optical computing – just trying to get the big matrices closer to the computation. So there's a lot of research and improving the hardware and that's exciting too. But when you look at all of these companies and research labs in quantum computing, it's going to take a very long time before they have real impact on computation, whereas there's just so much low hanging fruit right now using the techniques that we have and applying them to real problem.

So I'm advocating for continuing to do research. I just think that,if you're thinking about what your career currently what you want to do, I think you can have a lot more easy impact. It's not going to be super easy, but easier impact in the application. It's a little bit like physics, right? Physics has been around for a long time and obviously we still need to do a lot more research in physics. We don't have sort of levitation and crazy light speed travel and all of those things. But, man, if you want to have impact in physics because the field has been around for a long time, it takes so much. You've got to be so insanely smart, and you've got to have billions of dollars and build like large hadron colliders for thousands of other people over like 10, 20 years, and then you maybe find some new matter particle.

Richard Socher:So yes, continue doing research, but there's so many impactful things you can do with the current technology as-is. Just find interesting data sets, apply them to it, find workflows that people do manually and then either make them 10x more efficient or just automate an entire part of that.

Dhara Yu:We have another question from a student here, which I think is interesting and kind of tracks with your observation that everything's kind of converging on the Transformer architecture just because hardware is built around that. But the student question from Imran is: technically, any set of data can be represented as text. So is it possible that many models will, on presumably on other modalities, just become NLP models?

Richard Socher:That's a great question. Actually, and it's one of the reasons I love NLP as the main AI application – it always felt like that's where most of the innovation happened. That's not always true; like I said, neural networks were first majorly and almost exclusively applied to computer vision. And then, I was one of the first to bring some of these ideas over into NLP. Obviously, language does have a lot of interesting properties that make it so that whatever you find there lends itself well to lots of other modalities. So, for instance, we're thinking about how can we have impact with these large language models, but not compete on training a single model that costs like $2 to $5 million and still be very impactful.

So we actually thought about the language of proteins – just sequences of amino acid that govern everything in medicine, our human body, and life in general. So we trained the largest language model on protein sequences, with hundreds of millions of protein sequences. And then we actually asked the model to generate, and then we actually did the extra work and worked with wet labs to synthesize those proteins. And then actually found that they are vastly different from anything in nature, but because they captured the language of proteins and the grammar that is underlying life, they work better in many cases than the naturally occurring proteins. And so you're 100% right. Even proteins are essentially sequences, and the language of proteins can be captured in these models.

A couple years ago, Frances Arnold won the Nobel Prize. He took protein sequences and just randomly permuted a couple of amino acids and then tested if they're good experimentally. If they were – better in terms of pH and sort of robustness or temperature, robustness and things like that, or certain functions that gain a function that you want to have – she just took those and then randomly permuted again a couple of pairs and then like a couple of amino acids and again went through this process. It's very slow. And then she got to like proteins that are 2% different or 3% different from nature, and won the Nobel prize for that, because that was pretty breakthrough, and it's developed some chemistry that no one has really fully understood yet in chemistry – why it's so much better.

Long story short, the ones that we generated from this model are 40% different and they're still functional. They still fold and they have, in the case of lysozymes, even more antibacterial properties than naturally occurring lysozymes. And so it's like a whole new world that is very exciting.

There's also some work on artificial and supported creativity – supported creativity in the music space, where you can kind of put together a couple of a base rhythm and a tune and then just sort of, say, generate me a new song. Lav Varshney – he's doing some work on that. We worked together at Salesforce a couple years ago, two years ago, as a visiting professor there. And so they're thinking about doing a startup around this, too, now where you can basically eventually – what I want is you can just whistle a melody and then say give me an orchestral version of that whistle and then you just get like 50 different lines and you can have a pretty epic orchestral song.

I think we'll see a lot more innovation. Some things make more sense as just a big picture. Ultimately if you're in computer vision and you have one large matrix of pixels already, why make it into a sequence and then try to sort of retroactively do it? Obviously the visual cortex has a lot of highly paralyzed processing in it too.

It's also beautiful to think of all these different models as one very large model. I actually think multitask learning in general is super exciting and we'll see more work, hopefully, in multitask learning. Right now, if you really want the state of the art model that does one thing really well, most people train that model for that specific task and then they kind of forget most of the other tasks. Clearly humans can learn one task really well without forgetting a lot of their other tasks that they can do. And so I'm really excited about that, but it's hard to publish papers in it, because everyone wants to have that state of the art number, and showing that you have almost state of the art on like five different tasks, it's just harder to publish, because someone will be like, "Well, but I only care about this one task and on this one task you're not doing as well," as a reviewer. So there's some structural issues.

Chris Potts:But some spaces opening up for that, I feel. I feel the field is warming up to the idea that performing well at a bunch of tasks, and certainly performing well few-shot or zero-shot could be significant even if your numbers are lower or with a smaller model. Distillation – people tolerate numbers that aren't state of the art if your model is much more compact. So maybe the number is better along in different dimension. I feel optimistic about that.

Richard Socher:I hope so. We tried this with decaNLP. You have one model that solves 10 different problems. We called it the natural language processing decathlon or decaNLP. And we published a paper, and no one has been able to beat our numbers on across these 10 different tasks and kind of define some new sort of number that summarizes how well you do on all 10. And it's incredibly hard to have one model that does well on English to German translation and then also on English sentiment and English summarization and things like that. But it's super exciting.

Chris Potts:Maybe the field will get there.

I want to ask a question about your predictions for the future, but not around models and tasks so much as around data. What do you think the future is of creating data that's going to enable the next generation of big things?

Richard Socher:Oh boy. Yeah. It's really hard. Data labeling is becoming sort of the new feature engineering almost. People don't talk about it as much, but it's so important and that's how you spend almost all your time now – get as much raw data as you can.

Chris Potts:That has to change.

Richard Socher:Yeah, and it's hard. It's hard to fully change it. I think, as we get better and better with large language models, to capture general sort of linguistic and world knowledge in an unsupervised way, my hope is eventually you can get away with less train data, and we already see that with few-shot learning models. So I think in NLP, at least, we will see more of that.

The computer vision community now has taken ideas of just that we had an NLP like take out one word and try to predict which word that was based on its context. People have done that with pixels – you take out one chunk of the image and you try to like paint it back in based on the context, and those models have gotten much better.

So my hope is you can eventually use less and less training data. That's some interesting work that Chris Ré is doing too on defining sort of high level feature functions that then will automatically label a ton of different specific samples.

There's also a lot of work in sort of AI assisted labelin. If you label a tumor and a breast cancer image, and there are usually lots of little tumors there, then you can kind of do that 50 times, and then the next time it'll already overlay, and you just kind of say yes, yes, yes, or no, or kind of more quickly be able to go through more images and then eventually hopefully get to less and less of the obvious cases and more and more of the corner cases that are really hard judgments to make, and you can spend your time on that. So I do think we'll see increased efficiencies there.

Chris Potts:Dhara, are there other student questions about Richard's prognostications for the future or should we switch gears?

Dhara Yu:No, I think that was it.

Chris Potts:Hey, Richard. I have a sense for your answer to this question based on our conversation, but let me ask it directly. So you founded MetaMind in 2014 or 2015?

Richard Socher:In 2014. Yeah.

Chris Potts:In 2014. So how would MetaMind have got about things differently if it was founded in 2022?

Richard Socher:Very good question. Well, it was a different time, so it's hard to know if it was a mistake for the time or a mistake overall, but we basically allowed people to just drag and drop a bunch of word documents or images into the browser, and we trained the entire model for you. We did some error analysis for you. We automated all the backend processing and distribution of the training. We gave you a fully deployed model with three lines of Python code that just automates everything for you and then did sort of continuous model evaluation. And now each of these things I just mentioned is a half a unicorn to an $8 billion company, but we tried to do all 10 of those things to get a model out into the world as one thing, and it turns out that that was not the right strategy.

A lot of companies have folks inside who say I want to be the one who's in control of the AI model. I want to be the one AI expert that actually builds the model, but it turns out I don't really want to label the data. So maybe I'll buy a data labeling company like ScaleAI. That's an $8 billion evaluation company now that has tens of thousands of people in the Philippines to label data – a lot for self-driving cars, but they're also going into NLP now. And I don't really want to actually do the labeling and then I don't really want to scale up the system and deploy all these different models. I'm going to take some other open source or a company. And I don't really want to analyze my experiments and build a whole experimental validation frameworks. I use Weights & Biases to help me kind of track all the experiments and visualize them all in a beautiful way.

And I don't really want to implement the model, so Hugging Face is a very popular company now. I've been an investor since their seed round. They're going to make some big announcements very soon and it's very, very impressive what they've done. They've basically become the GitHub of AI models where all these different companies put their models into a Hugging Face – originally just for NLP, but now for lots of other things too.

And so then people don't really want to actually implement all the details of a paper so they just take the code from Hugging Face. And then in the end they basically take each of the hard parts of AI and they have sort of systems that do it for them. They save some money, they still feel like they're in control of the whole thing – that was clearly the better strategy. And so that's one of the things that would be different.

And then, just train larger language models.

Chris Potts:Oh, that's what I thought you might say – that it would be much more around a single massive artifact that would kind of be the model that served all customer needs.

Richard Socher:I mean, that was always my goal for MetaMind, but yeah, we spent way too much time actually working on specific problems for customers, like reclassify the head CT scans for brain intracranial hemorrhage, brain bleeds, and that was a very brutal model in the medical space, and it took a lot of resources and so on. And the medical space has all kinds of interesting and sad issues, which is: we saved about two dozen lives a week when that model was running in the first FDA approved trial of deep learning classifier in medical–vision space in late 2014, early 2015.

But it saved those lives in a way that no one wanted to pay for it. And we were a startup and we had to make money. So we had to shut it down. And it's really sad. I can go into the details if you're interested, but yeah, it's interesting. Most of the time the successful AI companies are actually vertical and they do one thing really, really well. Like for instance, Athelas started by Tanay Tandon, who was at Stanford too and dropped out. He was an intern of mine twice in high school and in college. And he basically just counted red and white blood cells, but he basically built it fully integrated from: here's the blood sample, there's the device. He ships the device at home, so you can do it at home. And it sends to the cloud instead of having a person sit there and count very obviously accountable red and white blood cells. He had a convolution of neural net do that automatically. And so it's a fully vertically integrated AI company

Most of the successful companies are in that space. But now we're seeing a couple that are actually horizontal and a lot of different AI and general companies can use them for different things. So Weights & Biases for experimental measurements and keeping track of your experiments and visualizing your experiments. Hugging Face for all kinds of modeling questions. There are lots of others to deploy these models. OpenAI, which I guess started as a nonprofit, but now is a full-profit organization, doing just really large models that a lot of people can use and then fine-tune for their tasks. That is probably what I would do also these days.

Chris Potts:I don't know the details on that MetaMind case where you were saving lives. But let me ask about that, because that's a chance to talk about, maybe, your approach to investing. The tension I see there is that it sounds like you, as AI practitioners, had an AI-driven solution that did some good in the world, but what was revealed to you is that the fundamental innovation is not an AI, but rather, for example, I'm guessing in something like creating a market for value-based care, where people actually have incentives not to bill for care, but rather to bill for a lack of need for care, or something that's really fundamental about how healthcare works. If you made those changes, your AI solution might be at the margins of all the good that would happen systemically throughout the healthcare industry, as it reoriented toward truly value-based care. Has that given you pause in terms of thinking about AI solutions as the primary thing, when you might think that what actually needs to be the focus is: what's the real world problem? And then maybe AI is a component, but not the driving force in the company or the vision. What's your reaction to that description?

Richard Socher:Oh man. You're touching on so many interesting things. So one thing is, I guess the personal takeaway that's quick is, I love healthcare still, but it moves so slowly. I'd rather invest in it and do something else myself where I can make faster iteration progress and have other people spend two or three years just dealing with the FDA and regulatory issues.

Chris Potts:But you could say the same thing like education, right? AI tutors would be wonderful, but it's not the systemic problem with education in the U.S. Those problems went elsewhere and you probably won't have an impact even if you had a magnificently intelligent tutor. And transportation is probably the same way, and so forth and so on. The AI is never the actual crux of it all.

Richard Socher:That's very true. Very often the modeling is only 5% of the overall code base. And the AI model overall is just one part. To be a viable company, you have to think about the economics of the things you're automating.

Maybe I'll just quickly try to explain the thing we built. So we basically helped triage in emergency room settings, head CT scans. And so head CT scan, someone comes in and says, "Oh, I just have a headache." And they're like, "Oh, something's wrong with the head, you're in the ER, we can build a lot for it. We'll send you to a CT scan." You get the CT scan, but then it's maybe late at night, which is often the case and you don't have enough radiologists on staff. You send it to a tele-radiology company, and that tele-radiology company basically gets a urgency rating for how quickly they need to return the analysis of that scan.

What we did is basically looked at all the incoming CT scans, and if you see intercranial hemorrhage, you essentially, if you don't like get the hole drilled into your skull and let the fluid out, it'll create so much pressure in your brain that you die. So it's a very urgent thing. If you have a brain bleed and it's growing or it's already relatively big, if you don't get surgery in the next 20 minutes, you're dead. And the only thing that they would then tell you, based on the analysis, if you get your head CT scan back half an hour later, the analysis is why that person is dead.

So we basically looked at the scan, if we saw an intercranial hemorrhage, no matter what the service-level agreement was with the hospital and the tele-radiology company would put it to the very top of the queue so that person could be saved in time.

We could see how often did people come in with a low urgency rating but our AI said it was high urgency and because of that brain bleed needed to get to the top. And then how often did the doctor actually show that indeed, that was true and that person had to go into emergency surgery. That happened about two dozen times a week. Those people, with that low-urgency rating coming in, wouldn't have gotten their scan analysis in time.

The sad thing was that the teleradiology company said, "We can't really ask the hospital to pay extra for this," because it's almost like if you don't pay extra for the service, we're going to let people die. So we don't want to pay for that as a tele-radiology company. We're also not really saving any money, because a radiologist still has to look at the scan at the end of the day, because there could be not just brain bleeds, but all kinds of other things that could be wrong with that person. So we still need to do all the same work. We don't save any costs, so we don't want to pay you for it, but maybe you can go out and sell it. It turned out like we couldn't because everyone is like, "That's really helpful," but we need to know 50 other things for which we didn't have enough data. To really fully replace the radiologist, there is not enough train data for anyone right now. So that was a lesson, and it's sad that we had to turn it off.

Chris Potts:Right. So it is the story that I was fearing, which is some systemic aspects meant that, for whatever reason, they couldn't or wouldn't pay you.

Richard Socher:That's right.

Chris Potts:And that stifled innovation, and that's a classic story in healthcare. And that's what made me think like, well you're a smart group. You should have been focused on changing the system, not developing models.

Richard Socher:That's right. Because it was a new space for us – it was our first exposure to the space – wasn't very efficient and since then I've learned a lot and I made some very good investments in the space that classify retina scans and blood measurements and AI hearing aid that filters out voice from a lot of background noise, because a lot of hearing aids, if you're in a restaurant and everything gets louder – People hearing disabilities of which there are many millions in the U.S. don't usually want to go to restaurants anymore because just everything that's amplified. This AI hearing aid actually reduces the background noise and just amplifies the main voice of the person they're having a conversation with. Those are all useful things that sometimes don't need full FDA approval, but are already very useful and can actually be sold too.

Chris Potts:Right. So just for your example there, they'll have an impact to the extent that it becomes affordable to buy without your insurance. And that will be the way in which they sidestep what would otherwise be a toxic market dynamic where there's a list of devices you can buy and you have to live with it, whether it has this AI-enabled technology as part of it or not. And there AI would be forcing a change through the back door, so to speak, in the system. That's a hopeful story in my little worldview here.

Richard Socher:That's right. And then I guess, man, you touched on a couple of other things, but I forgot what I wanted to say. But healthcare is still really interesting. I think it's good to work on things where you can iterate more quickly. I think investing is a lot of fun, but I don't encourage young people straight out of college to go into investing because you only get true signals every two to five years. You make an investment decision, and then two to five years later, you know if it's really successful or no. That's just such a low learning cycle for getting signal to get really good at something that it's not great to do in the beginning of your career. I think it's better to learn a lot of these lessons in other ways, like working in startups, and iterating quickly on that, and then eventually use those lessons to do investing.

Chris Potts:Let me pick up on that though because it sounds like you are encouraging about students going right from school, into the startup world. And you actually said before, that might be a way to have a large impact. And now you are in a position of an investor. So for the probably dozens of students who are going to listen to this recording and are thinking about a startup in the next few months, do you have some advice for them about what to do?

Richard Socher:Lots of advice. I'll try to restrict it to a few things. So one, I think the biggest one is probably, think about your skillset and about workflows in the world right now that are sub-optimal. And that workflow can be, I can share my video easily enough with friends. It can be, I worked in a marketing company and I couldn't do sentiment analysis well. There're all kinds of different ways, you can learn about different industries, understand their workflows and identify inefficiencies in those workflows that you could then think about making much more efficient with AI. That's one relatively simple way to identify issues and then build a successful company. And in some ways the best ones are ones that have a narrow beach head.

The most standard advice in some ways is find a very, very narrow niche of users and a problem that you can then dominate in a way that everyone in that niche knows that you're the right solution for it and then you can expand on this.

I guess, unfortunately, if you want to be a general purpose search engine company, I can't really follow my own advice on that because, even though we're focusing on developers – and maybe I can show you a couple of things on how we do that with summarization and such – but it's very hard to focus on just one tiny thing if if you're in search.

Chris Potts:So that was about the vision for the company and where to look for problems. What about actually just the day to day? Should they hire a bunch of people quickly? Should they get an office? Should they buy servers, some dedicated hardware? What about that low level stuff. I think that could make all the difference, right?

Richard Socher:That's true. But I feel like there the device is so dependent on what you do and what you're working on. It's very hard to give you abstract advice. We are fully remote and we always were, and we have off-sites every couple months where we all meet in person and bond and get to know each other as people. But if you have a certain kinds of jobs, you just can't be remote. If you work in robotics or something, you've just got to both be close to your customers and wherever they are, and you have to be able to travel to wherever the factories are that automate something, and actual warehouses, physical warehouses, which is a space that's blowing up. And I think there's finally enough capabilities there that that is actually a great place for automation.

Chris Potts:Yeah. I feel like, well, at least in my sphere, there's a bias against a startup that would need you to purchase a robot or a piece of manufacturing equipment. Everyone thinks of startups in terms of essentially websites or services. Is that an opportunity? Should people be thinking more about robots and devices for their startups? You have to take a risk up front that you don't if it's just a website, right?

Richard Socher:It is. But, man, hardware is hard. Hardware is very hard so you better have someone who has a lot of knowledge about hardware before you move into that space and some experience. The go-to-market is harder and iteration speeds are slower because now you have to just have physical devices all the time. So if you can, I still think pure software companies are easier to scale, but the bar's getting lower and lower to do this. And so you are right in some ways, you will have more competition.

Chris Potts:What if I need a huge team of people in the Philippines like Scale AI, to do labeling? Is that a similar risk, or a lower risk, or different risk from a robot?

Richard Socher:No, it's an execution risk because it takes a lot of people. You need really good HR processes to deal with such an army of people. But you are building a unique data asset and it is a so-called "competitive moat", because it's very hard for other people to just replicate that overnight.

Chris Potts:What if I am trying to do the thing you were doing at MetaMind, which is change the way hospitals work. Advice there?

Richard Socher:You definitely need to have a domain expert in your team or you need to just really spend a lot of time in that domain yourself. In medical, it's just very hard because eventually, you're going to have to sell to other people who have a medical background. And just the fact that you can come in with someone who's an MD-PhD or speaks their language just makes the sales process so much easier. It can be done. For instance, Tanay was a Stanford dropout. He doesn't have a medical degree and he's the CEO and his company's worth $1.5 billion now. They were able to go through that, but even he eventually hired some very senior people who have a lot of experience dealing with the FDA for several years, worked for the FDA for several year,s and then did their process.

So, if you're able to get to that stage, where you're able to hire these very senior people, because you've made enough progress yourself until then. A lot of times for startups and as an investor, you look at risks. I can tell you here, here are some of the major risks that as investor you look at. One is founder risk. Is the founder really smart and are they not willing to give up very easily? Are they thinking about things the right way? Are they discussing things well when you ask them and you criticize some of their ideas and they just have good answers for them rather than becoming defensive and whatnot. So that's founder risk.

Then there's co-founder risk. The vast majority of successful companies have multiple co-founders, usually two, not more than three, but also one of the main reasons companies break up early is that the co-founders just keep fighting and can't be productive together. So you have the co-founder risk.

Then you have the technical risk. Is it actually feasible to build what they say can be built?

And then you have the product risk? It's actually market risk, to some degree – is it actually something that the market really wants? Like they can build it, they're smart, they're working together well, but is it actually something that the market wants?

And then there's a go-to-market risk? Can they actually sell it in some reasonable, low-customer-acquisition-costs way. And then there's very high level structural risks and legal risks of like, will this thing become outlawed or will something become illegal? And because of that, what they're doing is not as interesting or not important anymore because now you can like disrupt the whole space in different ways. High level Zeitgeist or large macroeconomic scale.

So, for instance, there's a company that refurbishes lumber. A lot of people want to work in like carbon emissions, and it's a beautiful space, but the truth is there's a lot of wood that just gets thrown away. If you were actually able to refurbish a lot of existing lumber, that would be great for humanity. But the question is, how efficient can you make it? How well can you sell it after to make it a viable business? That way, you can have fewer trees be cut down for the wood that we all need to build houses and stuff – deal with the housing crisis without having to cut down new trees and let throw and so on. So, anyway, there're tons of interesting and positive things that you can do.

Personally I have a bonus for things that, where I feel like if that company succeeds the world will be a better place, but that also is sometimes tricky in that, for instance, one of my investments invested in underwater glue that was used and can be used to attach corals that are more robust to temperature changes onto bleached and dead coral. That was a really beautiful idea. I thought, "Oh, hopefully they'll sell some of it, but mostly the world would be better off." But now, turns out, if you have a really powerful underwater glue based on ideas from actual muscles in the ocean, all kinds of organizations want to quickly glue stuff together, which includes submarines and other things that are much less nice to the world bringing coral reefs back to life. And so you also don't have full control of all the applications, so eyelashes and teeth. Anyway, it's an exciting space.

Dhara Yu:I want to jump in with another student question that I think is relevant by way of our discussion about the economics of AI in some sense. So the question is from Tim and it says: computing is getting cheaper, but state-of-the-art language models are not. We're getting to a point where no single person could really afford to train one. The question is, do you think that this trend will ever reverse eventually or do you think one or another maybe will peter out? Then there's fun follow-up question, which, when do you think we will break the Turing Test? So a lot to cover there!

Richard Socher:All right. Turing Tests and prices.

I think, in terms of prices, they will 100% go down in terms of the inference costs. More and more organizations are currently working on large language models. In fact, a company called Anthropic just got another $600 million this week to train another large language model and just look at the ethics of it. It was an effective altruism type of investment by some crypto person who made a lot of money in crypto. A crypto expert made a lot of money, invested $600 million in training another large language model just to analyze the ethics and the biases and try to improve the state of the world that way. It's a question of how well that will work and how much that will improve the world. But that's just one more example of another organization working a lot on this.

There's also Cohere. They're working in large language models. It's almost guaranteed that other large companies are all working on their very large language models after they've seen the success that OpenAI has had with GPT-3. And once there's more competition, it will just drive the price down. It's just simple economics of supply and demand. Right now there's high demand, but very low supply, and that will change over time. Very likely that these will get cheaper.

Then what we've seen with TPUs at Google is that they will build custom hardware that originally it's just like, well, we want to do lots of matrix multiplies, so let's try to put those onto GPUs for gaming – literally gaming graphics cards. And now that's worked so well and been analyzed so much, they're building out chips to be even more focused – and you can get rid of all the graphic stuff and the rendering and ray tracing and parts of the GPU and just only have massive multiplication and faster interconnects and so on.

So we see that TPUs from Google. It's also how we had trained, back at Salesforce CoVE and the decaNLP models. And so those custom hardwares will make it even cheaper and more efficient to run these large models. And so my hunch is, it'll get cheaper and cheaper.

You can already, on you.com, if you search for how to write well, it'll actually give you for free access to have a whole essay or blog written about something. And if you want, I can share my screen and show you really quick.

Chris Potts:Well, wait, Richard, let's do some you.com stuff. I have a bunch of things I want to learn and you can show us, but what did Dhara mean – break the turning test? You mean show that it's unworkable as an assessment or show that we can fool humans? Because the second happened a long time ago!

Richard Socher:Exactly. Basically a Turing Test is a bad test and we should like stop thinking about it. The Turing Test in its definition of, can you distinguish whether that is a human or a robot, has been failed in both directions before. I think almost a decade ago already, someone beat and won the Turing Test by pretending to bea 10 year old boy from the Ukraine.

Chris Potts:Eugene Goostman.

Richard Socher:There you go. And you probably know this better than I do, but basically people thought, probably a human, it makes silly jokes and is a kid and doesn't really know that much and his English isn't great, but it's definitely a human. And then on the other hand, there are professors of Shakespeare and old English literature who are deemed to be a robot because no human could possibly know this much about so many arcane details about Shakespeare. So the Turing Test is, I think, deprecated.What will continue to be really hard for AI is to write long-form interesting texts like novels, because novels just don't have as much training data. And you want to have a very long sequence. That makes sense, that has a longer arc to the story, that has like temporal consistency and all of that.

And AI with all these models, it's just not very good at keeping state, if you will, and having sort of a very solid logical memory and having that connection. This is another one of the big barriers to AGI: a system that has both logic, set theory, higher level, like first or logic, and things like that, that is still trained. And one that has all the fuzzy probabilistic, statistical reasoning that we are used to now from these models. Once we can combine both of those, we'll have something. It's kind of crazy: if you ask GPT like what's 563 times 5,00,365, it will not know the answer, right? But your calculator can do that, and despite having billions of parameters and doing billions of computations and multiplications, it will not give you that answer. It can do 15 times 35 – it can do the small numbers that it's seen in the training data. And it can generalize a little bit from there, but it hasn't really understood math and even multiplication or addition of larger numbers, despite doing billions of computations and floating point operations. And so, once we combine these two, I think we can get to even more interesting Turing tests.

You see the same thing with DALL-E and compositional things like "put a red ball onto a black square" or something. And it's just all over the place and it's not always the right composition. So, that is going to be a much simpler Turing test.

Chris Potts:One more practical question: for you.com, students right now are starting to do their lit reviews for the course, for their final projects. Can they go to you.com and just have it write their lit reviews for them?

Richard Socher:I hate to tell you that, but yes.

Chris Potts:No. I mean, I can imagine this leading to some gains, because now we can ask them to read and report on 30 papers as opposed to 7.

Richard Socher:Yeah!

Chris Potts:Do you want to talk about you.com a little bit? I'm curious to know what your thinking is. I wanted to know how you got the domain, you.com. A lot of people have asked me to ask you that question and my follow-up question was, "What, you couldn't get the Twitter handle, you know: @you?" But yeah, how'd you get the domain?

Richard Socher:So, this all goes online. So, I'll do a relatively short version, but Marc Benioff owned the domain. He bought a lot of really smart, cool domains, because he saw in the late nineties that the Internet's going to be this thing that's going to be really big. And it makes sense to get a lot of useful domains, like "desk" and other sort of English nouns, including personal pronouns like "you", and then he actually gave it to me, but then MetaMind became sort of enterprise SaaS, B2B company, and I felt like it was kind of a waste of a really good consumer URL on an enterprise company like that sells B2B AI-platform capabilities. I gave it back to him and, fortunately, he didn't give it to anyone until we started. And then I actually started with SuSea, because we didn't yet have the domain, because we hadn't yet raised a round, and the domain was kind part of the round.

Chris Potts:Oh that's cool. So, you had it before and gave it up and then got it again. But what about this. @You on Twitter is, well, the place you go for all your meme response needs. It's not providing these useful things for memes, but you couldn't get that one?

Richard Socher:Yeah. We had to use @YouSearchEngine because unfortunately, owning the URL does not mean that you will own the Twitter handle of said URL. So yeah. Different systems.

Chris Potts:No, of course. I just thought this person might be able to give it up because it looks like they're not using, but @YouSearchEngine is the actual handle. We'll spread the word.

Richard Socher:Yeah. Thank you.

Chris Potts:And then obviously for this class, like, what's some cool NLU stuff that it can do or that's happening under the hood?

Richard Socher:I'm very happy to chat about that. So, let me show you what it is. So, I just threw in something just now that is maybe relevant. So, I asked that if you search for "how to write well", we have this fun little "you write" thing, which is a massive large language model, and it'll write you a little paragraph about whatever it is that you want, and you can all go to it and try it out yourself, and it's pretty neat. I actually used it myself. I have this little castle in the middle of nowhere near Death Valley National Park and I'm not very good with marketing, blah blah. And so I just asked like write about a castle in the middle of nowhere with some fun off-rading, and star gazing and it just wrote a nice three paragraphs and I just copied them almost as is I think I changed one adjective and that was it.

So, that's sort of very obvious. Then we're also getting paraphrasing – that's coming out soon. So, if you have a paragraph and you want it to be rephrased and you can say like make it more professional or make sure the grammar is good and it'll rephrase it in various different ways. So, that's one. One thing that we're very focused on is actually coding. So here we have, if you're, want to work with CSS flex or something, we have all these little code snippets and there's a little copy and paste button here. And then boom, you just saved yourself a ton of time by finding different code snippets that will just help you.

It includes like Stack Overflow, like "how to sort a dictionary by value" or something like that. Then you see here code snippets and you can also go quickly through them and kind of yeah. See different possible answers for your coding questions and we also have a large code complete model here, that basically will just write the entire function for you. So, it's also a large language model, but it will basically write any function that you want. So, I can say like, give me a fibonacci function and then it'll just write three versions.

Chris Potts:It's not efficient though!

Richard Socher:Yeah. This one is not very efficient one maybe.

Chris Potts:Yeah. Get it to do an iterative one!

Richard Socher:Yeah. Let's see. So, it's three different ones, and it has different parameters– higher temperature and then stuff like that, and some are larger models.

Chris Potts:Oh, I noticed it's putting the first line of code on that first line. Is that the same thing that vexes is all these language models about new line being so significant from your prompt?

Richard Socher:Yeah. You're right.

Chris Potts:That's funny. Well, I can edit that myself, if someone assigns this to me on my homework.

Richard Socher:You can play around with it and it can be like iterative version or something.

Then we actually have a bunch of summarization too. So, if you look for like best headphones as kind of a fun and hard and interesting sort of NLP project, but we basically just take out what's really the main things like specs or the summary of different in cases, headphones, but works for lots of other kinds of product categories, and from all these different sources, and ust help you quickly get to the facts and then help you make that decision. So, that's kind of fun.

Chris Potts:Is there a big language model or set of them behind these separate capabilities or what?

Richard Socher:The problem with language models is that they are not the best when it comes to veracity and just factual correctness, right? They will make up stuff sometimes, right? If you say, "Oh, write me an article about Hillary Clinton won the election," it'll write you that article, and it doesn't matter if it's happened or not. So, unfortunately since our values are trust, facts, and kindness, it's kind of hard to use language models as-is, and just let them loose on the main search results. Also they're usually too slow. So, you have to do precomputation of some things and so on. So, it's more complex than just a large language model to generate these sort of structured outputs to where you have, like, just these: bottom line and the pros and cons in this kind of format.

We're a bunch of nerdy people and love coding and AI, of course. So, we also have like an arXiv app and if you look for "attentions is all you need," we also have papers, and every paper on arXiv that has open-source implementations of it. We have all the different ways to all the different code packages that basically implement these various papers. We have a Quora app. Of course, Wikipedia – lots of cool stuff to learn there. A lot of people love Reddit.

I think in general, a lot of times AI automates something and then I think people first enjoyed that automation, but now more and more people want to have control over the AI. And so, one thing that I think is actually kind of important philosophically, and we're working on making it better in the UI is that, you actually can say the kinds of things you want to see more of. And so, if I want to see Brittanica or other encyclopedias more or less, then you can actually tell the model this. I want to see Instagram more than TikTok, or whatever. You can give that to AI, and then the ranker will actually be modified there. That is actually highly nontrivial, to make that work, with all the other NLP that we have to do.

So, if you do simple things, like, I don't know, "directions from LA to San Francisco," we have to do all the slot filling to extract, which of these is the actual from location to location and then put it directly into an app. So, there's a ton of NLP under the hood in a lot of these.

There's a massive GitHub app where you can see issues and stuff to. It's very helpful for developers and there's a lot of cool stuff to learn. I think sort of high level exploration. We actually have a ton of really cool features like PubMed and bioArXiv papers.

Chris Potts:So, non-cynically, Richard, if you have to write a lit review for a paper, would you use you.com in a particular way to help you get that thing done?

Richard Socher:Yeah, I think the way I would use this as a student is probably like: you want to obviously learn a little bit, but if you kind of don't know how to start, and you want to get something on the page, just to get you going, then maybe I would see what it says here and if it's good, then you can use it, and if not, then maybe not. Like: "review for the famous attention is all you need paper".

Chris Potts:Yeah. Cool.

Richard Socher:I don't know. Then you can say, like, 'Well, see here, you hopefully look and know that is not the author."

Chris Potts:But is there a more structured thing you could? You were up and down voting different sources, which is kind of like our guidance about using the ACL Anthology as a starting point.

Richard Socher:Sorry, this actually is very good. It just had a problem with that paper title and doesn't really know the paper title, but this actually is like "fully connected neural network". So, this is a good place to start, and then other than that, of course, just in terms of the papers, of course, yeah this is very helpful. Because you can find the papers, you can find the code implementations of those, you can play around with that. And if you want to write your own code, then that's also very helpful, and you can also see what kind of people are asking about it on Quora and how people talk about these kinds of different papers on Reddit. So, it is a very helpful way to really get a sense of a new topic, and then even have potentially a first version of this kind of blank-page problem getting started with it.

Chris Potts:Cool. So, I would encourage students to give it a try and see if it can lead them to parts of the literature that they wouldn't discover if they were just doing something traditional.

We have another student question, then I've got to ask you about your hobbies. Do you have time for that, Richard?

Richard Socher:Let's do it.

Chris Potts:Dhara, what's the student question?

Dhara Yu:Yeah. So, we have a question from Raul. Have you thought about pivoting a part of you.com to help consumers make more informed decisions about consumer goods purchases?

Richard Socher:Good question. Yeah, we have a lot of features around it, like the best headphone was a good example, where we have reviews and we help you kind of summarize those reviews and just tell you like the main pros and cons of different products. And then we have a lot of other ideas for it. The shopping vertical is kind of tricky. We actually implemented some and then the main problem was that people said, "Oh, you just have all these ads and sponsored results," and they skipped everything and we're like, "Wait, we don't have any ads on the site right now! And they're no sponsored anything!" We just try to do the right thing, but Google kind of poisoned the well, and Google has so many ads and then so much sponsored contents and then so many SEO microsites, that really just want to try to get you to go to Amazon with their referral link for a 24-hour cookie there, that people just stop trusting their search engine for a lot of search. They just assumed that our summaries were ads for those things. And we're like, man, we got the worst of both rows. Neither are we making money with ads nor do people get that it's just useful content, and so we kind of stopped and are rethinking shopping a little bit now. It's certainly clear that the way Google does it does not resonate with people, but we do think it's a very useful thing, and summarization can be incredibly powerful.

It's still kind of crazy: I was buying a walking treadmill for my standing desk, so I can be in walking meetings and still be home with a solid WiFi, because Menlo Park has no good internet for some reason, in between Palo Alto and Menlo Park. But it's so hard to find good things. Filtering out fake reviews while going through it. And then there's some things where you're just like, "Clearly this is a fake product." The picture of the person walking in a treadmill, like she's floating over the treadmill, and there's a clearly bad Photoshop, and it has 300 super positive reviews. And it's just like, your Spidey senses, but it's hard to find all the training data. There are AI companies for that, and it's certainly very interesting training data to help fill out fake reviews, and more around that will be coming well.

Chris Potts:Well that suffice to win back trust from consumers who are jaded because of Google?

Richard Socher:To at least do more and filter out more fake reviews. Because first they're jaded and they went to Amazon. Now they're getting jaded, because Amazon is making tons of money with advertising too, and they have tons of fake reviews, and they're clearly not interested enough in solving that problem. So, I do think there's consumer demand and interest in getting something more trustworthy, which is why Wirecutter is so popular, but Wirecutter is moving behind a paywall. So, there's no other good alternative. So, I think it's an interesting space, but you have to have some clever new approaches to it, that are actually heavily based on NLP and summarization.

Chris Potts:Well, are they? Because the Wirecutter is just all human curation, right?

Richard Socher:They currently are, but that doesn't scale very well. Yeah.

Chris Potts:Doesn't scale, yeah.

I have to ask you about your hobbies. I'll kick myself if I don't. So, I've seen lots of videos on Twitter and on the internet of you flying around with a ceiling fan on your back. It looks fantastically dangerous. What is this all about?

Richard Socher:So, if anyone is interested, I'm happy to introduce you to some good instructors, but it turns out you can fly. Do people see my screen?

Chris Potts:Yeah.

Richard Socher:All right. So, this is me flying along the Colorado River, and you can just fly in the most beautiful places and you just have this propellor backpack, which looks a little bit like this. This is what it looks like: I'm going to land on this tiny little island in the lake in central California, just for a second, and then fly off again. And you can basically fly anywhere you can go. The U.S. is quite unrestricted. Here I'm in Slovenia. It's beautiful old castle and you basically have a propeller backpack and a paraglider over your head and you can go almost anywhere. Here we're flying close to Las Vegas with a friend – it's a little bit of an optical illusion actually.

Chris Potts:I was going to ask, can you commute from your house in the Santa Cruz Hills to Menlo Park with this?

Richard Socher:It is a weather dependent sport. So, I wouldn't like rely on it for commute, because if the weather is not right – there's too much wind or it's raining or there's a temporary flight restriction because the president's coming in or something like that – then you can't commute.

Chris Potts:Is that the only restriction though? So, in principle weather good, no special restrictions, you could commute?

Richard Socher:You also can't fly over congested city areas.

Chris Potts:Oh I see. Okay.

Richard Socher:So, you would have to land always on the outskirts of it.

I actually got fully tandem-instructor certified so I can bring friends with me in a tandem setup. So, that's really fun and it's if you want to fly ever, let me know. I enjoy it so, so much. I'm happy to bring people into the sport. This was just one of the many incredible moments you can put your camera like into the harness. So, it kind of points down at your seat and then you just see these incredibly beautiful moments, and that is paramotoring. I have a bunch of other hobbies, but this is kind of taken over most of what I'm doing. This is in central California.

Chris Potts:Is it dangerous?

Richard Socher:So, it's kind of like riding a motorcycle, like you can try to make it quite safe, but you can also make it very dangerous if you want.

Chris Potts:I'm sure!

Richard Socher:So, like here, this for instance right now is not too dangerous. I'm very high up. Actually the higher you are, the safer it is, because if there is anything wrong, like you motor dies, then you can glide. And then if my motor died right now, I could just land right here on this little island and I have to get helicoptered out, which is doable. If you have friends with you, have a rescue beacon on you and things like that.

Maybe I'll show you – this is probably one of my least safe moments. Here I'm flying over this incredibly beautiful canyon, but I'm so low that if my motor died right here, I would just fall into the water and fall down this waterfall. So, those were a few seconds where engine failure or some other kind of failure – like wind not felt out and corrected – would've resulted in potential death.

So, generally, actually, half of parmotoring deaths are drowning. People get too close to water, then fall into the water, and then don't have so-called power floats on them – little C02 cartridges with water sensors that blow up. And so they then drown with hundreds of lines around them, with like a 40 pound anchor that pulls them down through water. Very preventable, but people don't often do it. And the closer yoiu are to the ground, the more dangerous it is, but it also gets a little bit more fun. So, it's kind of like motorcycles, but there are no road signs. You have to read the signs of the wind and the forecast and everything yourself, in order to really know what you're doing. And don't fly like this when you start, you just go high, you find everything, you make sure you trust your gear and all of that. And then you can do these kinds of things maybe later.

Chris Potts:Well, let's end in a less scary place. So, I was also touch looking at your website that you've been visiting lots of U.S National Parks, which I really think are treasures. What's the number one park on your list of parks still to visit?

Richard Socher:Still to visit?

Chris Potts:Yeah.

Richard Socher:I would maybe throw out American Samoa National Park, because it's so hard to get to. It's closer to Fiji than it is to any U.S Territory, but it seems like a really beautiful part with lots of natural beauty. Other than that, I'd probably go up to Alaska and start exploring some of those very large National Parks out there.

Chris Potts:What about an underrated one that you have visited?

Richard Socher:Underrated? I think the Great Sand Dunes National Park was really epic. It has the largest sand dune in the United States, over 600 feet high. It's very windy and the clouds are constantly rolling in and creating very beautiful patterns, and I had not known about it before at all.

There's also White Sands National Park in New Mexico, which I found very beautiful. It just became a National Park a couple years ago. Just massive, endless white sand dunes of gypsum sand.

Chris Potts:Wonderful. Thank you so much Richard!

Richard Socher:...And then of course the best in general: you got to go to Zion National Park and in general the Moab, Utah area, those two.

Chris Potts:Oh yeah. I want to go to Zion. If that's underrated, then I really need to go to Zion!

Richard Socher:It's actually it's very much not underrated. It's a very popular National Park, one of the most popular ones, but for a reason, and it's still worthwhile going.

Chris Potts:Wonderful. Well thank you so much, Richard. Thank you, Dhara. Thank you, all students. Yeah, this was great fun. I really appreciate it.