CS224U: Natural Language Understanding

Podcast episode: Omar Khattab

April 25, 2022

With Chris Potts and Aasavari Kakne

Pronouncing "ColBERT", the origins of ColBERT, doing NLP from an IR perspective, how getting "scooped" can be productive, OpenQA and related tasks, PhD journeys, why even retrieval plus attention is not all you need, multilingual knowledge-intensive NLP, aiming high in research projects

Show notes

Transcript

Chris Potts:All right. Welcome everyone. It's great to see people streaming in. I'm delighted to welcome Omar Khattab. Omar is of course a member of our teaching team and a member of our community in Stanford NLP in the DAWN lab. So it feels especially special to have him as our "guest" today.

Omar is doing incredibly innovative research at the intersection of information retrieval and natural language processing. And in fact, I think of him as part of this recent resurgence of efforts that are truly reuniting the fields of NLP and information retrieval, especially around knowledge intensive tasks that we might want to do over, for example, the entire web. It's a thriving area and we're going to hear about all the exciting research that Omar has been doing along those lines. And in addition, I hope to learn a little bit about Omar's past. Omar is my student with Matei Zaharia, but I feel like I don't know very much about him, so this is my chance to learn a little bit more. So, welcome Omar!

Omar Khattab:Thanks, Chris and I'm really delighted, by the way, to do this with you and with Aasavari.

Chris Potts:I think this is great.

I want to start, when we interviewed Douwe Kiela, we started with this question of pronunciation of his name, and I would like to start in a similar place with you. So give me the definitive answer. Is it col-BAIR or Col-BERT?

Omar Khattab:That was an unexpected question. I thought it was my name.

Chris Potts:You can comment if you want to on how I pronounced your last name. I was thinking about that.

Omar Khattab:I think most people say Col-BERT. I definitely say Col-BAIR and you and Matei, I think we all say Col-BAIR. But I've never met anyone who hasn't interacted with us frequently who says Col-BAIR. So everyone says Col-BERT. And so I feel like it's wrong to make either of them incorrect.

Chris Potts:Okay. I've always said Col-BAIR. I thought that was part of the joke. And in fact, I think it's still the case that if you Google "colbert paper", one of the first things to pop up is a video of Stephen Colbert throwing balled up paper into a waste paper basket. He's better than Obama at this, even though Obama is a good basketball player. So everyone should Google "colbert paper" and then watch or read whatever comes up.

Omar Khattab:Well, everyone should Google "colbert paper" and click on the paper because Google will learn from your clicks. That's it.

Chris Potts:You want to take over the space from Stephen. I grant you though that ColBERT makes more sense.

Omar Khattab:More compositional, yeah.

Chris Potts:It's certainly the way it's written.

My serious question about this though that I've never really gotten a chance to ask you is: where's the spark behind the ColBERT model? So maybe you could do two things – just tell us a little bit about it, but also tell us where the ideas came from. Because for me, you and Matei approached me with this idea basically fully formed, and I think maybe you already accepted at SIGIR, and then we started talking about NLP, but I didn't get to participate in the exciting moment where you had this intellectual breakthrough.

Omar Khattab:Definitely, but I don't think you're giving yourself enough credit for this particular project, Chris, because we approached you when it was still like, "We have great results, but it's still very early stage". And I think your encouragement there and your reaction – it made a difference in how we felt about it, because we hadn't communicated to anyone at that time. So getting someone who's core in the NLP community – and obviously Matei and I, at that time we weren't really doing NLP research – so that was definitely important. SIGIR came several months later.

Chris Potts:Okay, well, I'm happy to take all the credit.

Omar Khattab:[Laughs]. I guess most of our students probably have watched the lecture notes by now, the lectures on this by now. But, in a nutshell, ColBERT is a scalable retrieval model that uses in encoders that are Transformers, usually BERT, and it achieves really high precision, similar to what we achieve if we were to a call BERT on every single document with our query, but it also achieves scalability and very high recall because it decomposes interactions between queries and documents into very quick and scalable comparisons between matrices, as opposed to multi-layer interactions with Transformers.

So it's orders of magnitude faster, but the striking thing to us was that it really does preserve quality and that really opens up a lot of possibilities. So I worked on ColBERT with Matei in what was really the perfect time. It was a very lucky time. This was fall 2019. So that was my first quarter here. I keep my slides, like most PhD students do, so I still have the slides, and I'm very nostalgic about that period because it was my first few months here and it was also pre-pandemic Stanford.

Chris Potts:I remember, we met in person! It seemed like such a normal thing!

Omar Khattab:Exactly. We only met twice in person until last year or something. So I might ramble too much here because these were very fun days.

A few months before we started working on this was a very quick transformation in IR, where people realized that, hey, this stuff is happening in NLP with BERT and Transformers, and it could be applied to our stuff. And there was this new MS MARCO benchmark that just then started doing IR. It was mainly question answering before that. And you started, all of a sudden, you have very large data sets and very large pre-trained models.

And going from BM25 is the lead of everything – BM25 being a 1995 model – to wait, BERT is twice as good. And twice as good as just a really big thing to achieve in a few months. It was very striking to everyone.

So striking, in fact, that as someone who looked at IR efficiency during undergrad, all of a sudden, we went from really, as a community, caring that our models run in nine milliseconds per query to, well, it's nine seconds, but the quality is really good!

So that's something that motivated Matei and myself to think hard about this. Can we get that quality at a much, much lower cost? And, in particular, can we do that through some offline indexing? Because you're given a corpus, the documents are there. They're probably not going to change very often. So could we just pre-compute some of the magic (as it seemed at the time) behind BERT and preserve as much as possible of that quality.

Chris Potts:Well, help me out with the chronology here. Because I remember that Google made an announcement that BERT was super charging Google search now. And maybe I was just inattentive, but I was confused because it sounded like something that would not possibly happen at the speeds that Google needs things to happen. And I think what I was assuming is that they were doing a full crossing encoding of documents and queries, and I thought maybe they're using a distilled model or something, but even then I couldn't quite figure out what they meant, but maybe you knew more. Was that announcement an impetus or did that come after?

Omar Khattab:So things were moving so fast. I think Bing announced they're using their stuff in late October. So that was 2019.

Chris Potts:2019?

Omar Khattab:2019. And Google was in November. So we were already underway. And I think we already had ColBERT by the time that Google made their announcement. But what exactly these folks are doing was not clear at the time. I think, since then, there's more information that's been released publicly, and obviously it's been evolving very fast.

I think one thing to say about IR is that because you're just ranking, there's always this idea of telescoping, where, if you have a very expensive model, well, maybe you can re-ran the top five results instead of re-ranking more and use that to control your costs. But you also trade away a lot of the quality that way. So it's not super clear what Google was doing at the time, but it's definitely clear that many of the large players in this space now have much more innovative things than the standard "let's just cross-encode things" by now, by 2022.

Chris Potts:But in the ColBERT paper – and I remember, I think I saw the diagrams in the fall of 2019 – ou have that nice panel that's full cross encoding of documents and queries, which looks completely intractable in every dimension. You have a middle ground, which is separate in coding of documents and queries into vectors. And I could see how that's going to support full offline indexing, and then you just do a single vector comparison. And then there's ColBERT – ColBERT doing its token level comparisons with MaxSim. Where did that framework come from? Was the middle one, the DPR-style model, already in the literature?

Omar Khattab:No. So DPR and ColBERT both arrived on arXiv around mid-April. ColBERT was there as an accepted SIGIR paper. DPR was the paper before they went to EMNLP. So DPR, that was a striking thing, that DPR-style models based on BERT and Transformers didn't really hit IR until around or after ColBERT. And so what we were really engaging with at that time in terms of efficient models were different.

There's a really nice paper from Microsoft. I think Bhaskar Mitra is the first author. It looked at this relatively, I would say, unrealistic assumption, but it was still very, very exciting: if you could take BERT with every document and give it each word in your vocabulary – the vocabulary could be huge – and, for every word, predict a score against every document, then you could get really, really high quality that's maybe three, four points worse than the original BERT, but hey, that's a lot better than we could do with traditional models. And that wasn't very realistic, but that was a very clear signal that there's something in offline indexing. So I think Matei and I quickly realized that it's probably okay. So that was what we were engaging with at that time. It wasn't really single vector models.

Chris Potts:So was that the spark behind ColBERT, the idea that you'd do token-level things? It sounds like you'd reached that insight.

Omar Khattab:Yeah. So token-level things have been really popular in IR, but that paper was really a motivating factor for a lot of work, not just ours – that with Transformers, you really can get a lot of the quality with what they call term-independence assumptions. That's the term independence assumption in the paper. It was a very cool paper.

Chris Potts:That's already fascinating. Let's linger over that. What I hear you saying is that, from an NLP perspective, I think DPR looks really natural. Let's encode everything into a single summary vector and do a comparison. And that's what we're always shooting for, a representation of a full sentence or document or query. And it sounds like what you're saying is that the IR perspective was default toward token-level comparisons because we knew those were so successful even back to BM25. Is that right?

Omar Khattab:Yeah. That tension is very strong in IR and remains very strong. People always thought of those as representation-based models and interaction-based models, and there was always this tension. And I think people really would prefer the interaction-based ones. But when they said interactions, it meant two small convolutional layers over a small matrix that runs in a half a millisecond on a CPU! It wasn't BERT! And even then that was traditionally considered pretty expensive in the pre-BERT days. Few people bothered to even engage with that.

I don't want to make sweeping claims, but not everyone looked at that as the standard way of doing IR, and that is also not so massively strong that everyone had to just drop what they were doing and move on. That's what BERT did to people.

So there was this brewing feeling, I think, exemplified in the independence paper, that maybe we can get BERT to pre-compute scores. Obviously it's not really feasible to go for every one of your documents and call BERT, 100,000 times for every word in your vocabulary. But maybe, we thought, if we restrict this to only words that appear in the document, then we could leverage the Transformer architecture and get a score for every token that's in there at once, amortizing that call to BERT. And we did that, and it was pretty good. It wasn't as good as doing it with every term in the vocabulary – that's the gap that we were trying to bridge. And I spent a couple weeks just trying everything I could think of, including a hybrid of this approach with the single vector representation, because that could be more holistic and contextual.

I never really made a lot of progress over the initial solution of, let's just produce a score for every token. So to my dismay, slight dismay, at that time, CMU folks release a paper that does exactly that – it produces a score for every word in the document. It's called DeepCT. Their scores were a few points weaker than what we had at that time. But the key idea of, let's just take all the words and produce scores for all of them, was presented in a very natural and nice way by that team.

Matei had traveled for a couple of weeks at that time. And I didn't want to share the bad news of, "Hey, we're scooped," maybe without some – well, maybe there is a way out of this other than, hey, our thing is slightly better, a few points better.

That was ultimately why I really like – well, I don't mind, maybe – getting slightly scooped in the early stages of a project, because it gives you that drive to think slightly above and beyond. The motivating factor there was these folks were still 9 or 10 points worse than what you'd get in with standard BERT. So there was a lot left on the table, but it was clear that people are engaging with this problem and it's really gaining traction and people care. That's an encouraging signal at that stage.

So being stuck on single vector models – I tried them, they weren't very great. Multi-token scores aren't really getting all the way there. And it's like, well, the scores don't really contextualize. They don't really allow BERT to do what it wants to do in terms of capturing semantics and really breaking from term identity – a car is only a car, it is nothing else and has no relation to any other words. And the single vector model just lose all of the nice assumptions we like to make in IR, where all the structure and composition that we like about relevance can be the composed one way or another into comparisons between terms.

So it was very natural at that stage to say, well, "what if we produce a vector for every word and maybe find an efficient way to reduce those scores so that you get a score for every document, given your query". And it was very natural with that to say, "well, it's probably not going to work, but if you were to do it in the simplest way possible, you would just say, well, for every word in your query, find the closest alignment in the document and, I don't know, just add up all of these scores and let's just see what happens." It's a good baseline at the very least.

What really struck me was that this worked from the first try really, really, really well. I think it was basically as good as the BERT precision, and I was very sure that it was just a bug in my evaluation. I ultimately probably spent months deeply feeling like this was probably some really elaborate bug that I haven't figured out. But it wasn't a bug.

Matei was back, and I was meeting him quite soon after that, so I had to give it a cool name. And so one night I was watching Steven ColBERT and I was like-

Chris Potts:Oh, see, okay. The connection was there, Colbert is the origin.

Omar Khattab:I mean, BERT is in his name. I mean, come on, someone should do something with that!

Chris Potts:This is fascinating Omar, because what I hear you saying is that first of all, it was really in the fusion of these two fields that you found this point of innovation – a mode of thinking from IR which was subtly different from a mode in NLP. And then bringing those two things together, that was the first initial step, which it sounds like the other group had as well. But what about the late interaction part? I mean, that's right there in the name and the MaxSim thing that's associated with it. Do you remember the origins of those pieces?

Omar Khattab:Yeah. Yeah. So if you were to have interaction between two matrices, the simplest thing you could do, I claim, is just to align every word in the query. The simplest thing that you would do given some intuition for IR, and I think intuition for IR common, it's not really very advanced. You would say, "well, all I'm trying to say here is I want a score that corresponds to every word in my query." And the way I could do that is maybe cosine similarity with the closest vector in the document. So that's just the max. And if I have a bunch of maximum scores, well, I don't know, maybe just average or add them up, and that's MaxSim.

But it's not that thoughtless, I think. One motivating factor to not jump into convolutions or large transformations on top of BERT was – my work in undergrad taught me that maximums and summations are really easy to optimize for when you're trying to do top-K search. Because you have summation, you can decompose your search across the individual components. They're basically independent until they sum. And maximums lend themselves very nicely to ideas around sorting and clustering. And so if you're entire interaction mechanism is only summations and maximums, I think you can go pretty far in terms of scaling. Not just efficiency. But when it comes to scaling to large collections, you don't even need to consider all the options because you're only looking at the maximums. So why would you bother with small things? So that's where maximum similarity – MaxSim – comes from.

Chris Potts:So that's really fascinating too. Simply the focus that you had on scaling is a bit unusual in the context of NLP, where that is often very low on the list of things that people are considering. And we sometimes even brag about how long things take for us – months to train and so forth. Inference takes forever. In the IR world though, it's heavily baked-in that you have to pay attention to latency and re-indexing time and maybe even some space constraints. And I feel like those pressures also were pushing you toward the ColBERT model. Is that correct?

Omar Khattab:That's true. And I think Matei and I shared that with folks in the IR community, that we really care about things that are practical. A really cool thing in IR that fascinates with me with the field and... Is that well, a lot of systems folks out there – and there's tremendous amount of really amazing systems research – but a lot of systems-only folks might fall into the trap of optimizing things because we want them to be fast. I come from a systems background, and a lot of machine learning folks, as you were saying, Chris, just care about accuracy on a particular set of benchmarks and that's all that matters. And maybe the larger the better. Why? Because why not, or because we afford to do it.

And I think IR had this baked-in, as you were saying, tension, where it's really a trade off: there's efficiency and effectiveness. And there is a user at the center of this who's waiting for a result with an information need. Because otherwise, maybe you would just off-load this to a group of librarians who would take a week and then get you the really authoritative answer. But that's not really what you want to do. You want the search system. So that's something that's really fascinating about IR. But NLP was key to obviously getting this stuff to really, really work.

Chris Potts:I remember our first in-person meeting with Matei, and you wanted to talk a lot about how linguists might help you make better use of document structure. And I think I was just obsessively asking you about this scalability issue because I think I saw in what you had presented that this was the path to actually doing question answering over more than just a sample of passages from Wikipedia. And I was starting to think, "Wow, we can probably actually achieve the vision that we've implicitly had of asking questions or doing commonsense reasoning over the Web, but not if we continue on acting like scalability never mattered." And so I was probably pushing you toward tasks – and I think that's what we did, it's like, "We got to do this OpenQA thing!"

Omar Khattab:Definitely.

Chris Potts:And we've never returned to your ideas about document structure. Maybe in the future.

Omar Khattab:We could in the future. I think it was right call, OpenQA was definitely eye opening, because it quickly sprawled into so many other tasks where I think bringing NLP and retrieval together goes a long way, and it seems indispensable at this stage.

Chris Potts:Just wonderful. And that was a new task. I want to actually talk about that next, but I have to rescue two more other comments. So first, Siyan says in the chat, "I think being scooped to some extent does validate your research." I think that's a very mature perspective. And you said something similar Omar, which is not only does it validate, but it actually helped your research, which is some new level of enlightenment. Because I get a lot of anxiety about being scooped, as you know. Can you say more a bit about that? What's your thinking there? It sounds like it should be more widespread – getting scooped as a good thing!

Omar Khattab:Wwhen working on a project, we're always – especially if it's a hot area – we're always worried about getting scooped.

I'm not saying this is necessarily the most time-tested way of thinking – it definitely isn't – and I could change my mind – but it feels like, in every project, there is this spark of hypothesis that you have that, in hot areas, is actually a very common idea that a lot of people are thinking at the same time. Maybe up to a year prior, everyone is just thinking, when is someone going to just do that? And there is this tension early in projects where it's like, "should I just be the paper that just scratches that off the list?"

Should we be the term independence paper – which was, by the way really great. And I wouldn't have thought of go ColBERT without it. So definitely it's something that has its place. Or, should we build a more complete system that goes beyond your first intuition. And I think of scooping as breaking a tie. Obviously if you can do something great and release it before someone else you shouldn't be waiting. But scooping is like, well, "oh well they took the simpler one, I now have to go and do the above and beyond, if I can." Maybe that's one way to think about it. And it's really something to think of in retrospect. One shouldn't generally be waiting to get scooped. It's a bad idea. But it's a way to, I guess, look at the silver lining.

Chris Potts:Yeah. I really like that, which is: you're working and if at some point something appears that's in the area that you thought you were going to contribute to, that's actually just a chance to build on what they've done or go past it. It's like how knowing what the top score is on the leaderboard makes it easier to do a little bit more hill climbing in a way that can be productive.

Omar Khattab:And they've shown you their cards. So what's nicer than that? You get to beat them!

Chris Potts:Well, in the spirit of open science, you get to build on their contributions. Yeah. And I really like that. It's just like a deliberative approach to research that I think we should all adopt. Because what you said is really focused on the long term and not just what happened this month, which seems very healthy.

Let's return to this OpenQA thing though. I would like to reconstruct this a little bit too. Because I do know that was the thing that I became obsessed with after our conversation. And I think I had pushed: this should be at least our first task and we can worry about document parsing later. Because some papers had appeared, certainly one was by Danqi Chen. But in any case, it was an up-and-coming topic, OpenQA. Are you able to reconstruct this? What did we do first?

Omar Khattab:I do remember you sent us a few pointers to OpenQA literature. I think it was Danqi's paper. Maybe also the ORQA paper by folks at Google. ORQA is a paper that has connections to IR that maybe weren't really fleshed out in the paper itself. DPR made that stronger connection to IR that stuck in the collective minds of NLP people. ORQA was a great paper.

So, it was this really new space to me, and I think it was just very new in the literature overall. And we had this hypothesis that, well for a start, you obviously the best retriever you can get in these tasks and thinking about that decomposition could go a long way.

One of the first experiments we tried was: if you just take the ColBERT model, train for IR on IR data sets and you just throw it at the problem, I believe we were seeing noticeably better results on some benchmarks than fine-tuned retrievers in the OpenQA literature and that started the ColBERT-QA project of, well, how can we best supervise these models, but also what do we get from first, just integrating stronger retrievers in OpenQA, and and how does that interact with the reader?

Chris Potts:It's really exciting because not only is it serving the scalability goal of actually plausibly having a system that could work on the Web, but it's also reducing the demands on us as people who need to create benchmarks because you go from the SQuAD formulation, where you need a gold passage that contains the answer as a sub-string, and you need an associated question, to now in principle, just needing question-answer pairs, because you're going to retrieve your passage and then hope that it contains your answer. But that means that data collection here could be much more lightweight. And conversely, you could do it over as large collection as your retriever could support essentially. I think it's the future of question answering tasks or at least the start of a future that is really open domain.

Omar Khattab:Yeah. And I think, I mean by way of joking, I think we should collectively demand that "QA" refers to OpenQA because that's what someone who hasn't really been in the NLP community for a few months or years would expect it to be – you're given a question and you should answer it. But, as you all know, currently "QA" tends to refer to single passage QA or context given QA or machine reading comprehension, basically. That was great at its time, which, because of how fast things are moving, was just five, six years ago, maybe less. But I think we're engaging with much harder problems in NLP space increasingly.

Chris Potts:Or maybe it's come full circle because the full history of this is the oldest uses of QA are like, "Here's some questions and here are some answers, what are you going to do about it?" And then the diversity of approaches is incredible, hand-built databases and other things like that, grammar engineering. So that was QA. And then it's funny that when SQuAD appeared, that was "OpenQA", because the whole point there was that you have this huge collection of Wikipedia documents. You don't have any guarantee that you're going to be querying into a structured database. And so therefore it's "open". We had to reclaim the term. And so maybe it will go full circle once again, because we're now approaching the original formulation: questions and answers and you do what it takes. It's just that now the systems work to some degree.

Omar Khattab:That's true. And I think it is actually probably close to time to going even wider. So most of the OpenQA benchmarks are focused on factoid questions or short answers that are names or dates or entities. And I think it's time to start maybe scratching the surface of how far can we go with maybe more open domain summarization where you engage with a lot of documents, you synthesize, you summarize, but you cite your sources.

And obviously, evaluation is hard. There's obviously lots of work on long form generation, but that could take on a more central role where maybe the queries are themselves a lot richer and conversational, like we see in some of the data sets out there these days like QRECC and Wizard of Wikipediat. Definitely QA can expand to include a much larger set of challenges, hopefully combined together. And one of those is multi-hop.

Chris Potts:Well, I was just going to say, where's the position of multi-hop in your thinking? Multi-hop is the case where by design, or maybe just because this can happen in the world, a question is answered only by multiple documents. So you can't just extract it as a sub-string. What are your thinking about that task?

Omar Khattab:The conventional thinking is that multi-hop is this special case where solving a task requires more than one source. I like this tasl, but this needs more support from the benchmarking side. And maybe we could do this, maybe someone else could do this, where multi-hop is really the default case where a single document is enough to answer your hard question because most things are not already been discussed in a single well-contained sentence somewhere. So most difficult, really engaging questions are probably challenging enough that you can't fully answer them by extracting three words from one sentence.

It is hard to come across things that are explicitly like that. So the data sets are usually slightly more artificial or slightly more human constructed. But I think the challenges that that poses are really important for us as we think of how far can we go with retrieval-based NLP systems? Because one thing that language models are just really great at – and I really like the way you've expressed this, Chris, over the months and years – is they're just really good at synthesis, that's all they do. If the retrieval-based approaches can't synthesize information across sources while we're really stuck. We'd still have a lot of impactful applications, especially with translation and with the presence of such a large Web in specific languages like English and maybe Chinese, where people have discussed so many of the topics that are on people's minds, so you can find one source that addresses a lot of the popular questions. In other words, you can go pretty far with that. But if we really would like systems that can help you with tasks that are fairly complex beyond that, I think multi-hop capacities are necessary. And we've looked at that in Baleen.

Chris Potts:Yeah, I totally agree. I think it would be wonderful to move to a place where the benchmarks were just a bunch of hard questions with answers and you might not know which of them are multi-hop, and that might not even be a defined concept for some of these models, but in general systems would win to the extent that they could synthesize across documents in even very complicated ways that you might not describe as hops so much is just pooling information. Yeah. And I feel like that's going to happen. That's the trajectory of QA at this point.

Aasavari, are there any questions on this theme of QA from you or from the students before we switch gears?

Aasavari Kakne:I don't see any questions from the students. Although, I had one question about the interesting idea Omar mentioned that the most groundbreaking ideas in research are generally common knowledge. If a student comes into the new field, then how do they go about getting this intuition within them and basically getting familiar with all these central ideas in the field?

Omar Khattab:I think I can't remember which one of our guests before said this, I really liked it. Maybe it was Douwe, or I could be mixing information. I need better providence!

Chris Potts:It could be multi-hop on our guests. Maybe you're pooling all their insights.

Omar Khattab:In any case, a popular, thing to say is you should pick a paper that you really like on a specific topic and look at what they've left that you'd like to contribute to – ideally something that's self-contained. I think going from there is very productive. One could think of something like ColBERT. The ColBERT paper is framed as a this response to Nogueira and Cho, who built the first BERT encoder with attention for re-ranking. We engage with their ideas that these are powerful encoders, alongside everyone else in the literature, but just say, "Well, hey, instead of encoding things jointly this way, if you encode them into matrices and do an interaction mechanism, it helps." So really it's a very natural thing to do in our research.

Chris Potts:Another other response that I'd like to give, Aasavari is like, well, be bold. We could hear in Omar's story that he wasn't yet an NLPer. He came from an IR and systems background, and we just saw that those different perspectives on NLP problems were incredibly fruitful. And so I guess you just have to be audacious enough to say,"I'm coming at this in a new way. I might be an outsider, but these ideas seem to have value," and just move forward. And I, frankly, not just Omar, I can think of lots of instances where that was true. Because I think it's a common observation in the history of science that the big ideas are about making surprising connections across fields, because often a field is in a rut and it needs someone to shake things up, but people need to be bold enough to do it.

Aasavari Kakne:That's awesome. And I think as a Masters student, I think I'm able to connect more into it, because we do many courses that give us a background in, say, different things, like GNNs and NLU. And sometimes it feels like, "Hey, I can make significant progress in any one area, but it's a great motivation to keep building on these things and maybe we can connect them in future."

Chris Potts:Beautiful. Yeah, I often tell students, "Look, as you move through your career, the whole world is going to push you to specialize relentlessly until by the end of your career, you do one thing very well. Fight it off as much as you can, try to read widely, experience widely, because you just never know what's going to come out of all those interactions." And Omar is, again, an example of that because he's in two labs that are oriented toward a different set of questions, but with a lot of shared goals as well. And I think that continues to produce interesting things. Right, Omar?

Omar Khattab:Absolutely. And I think your flexibility and Matei's flexibility and open mindedness about this has been acknowledged by lot of people who have spoken to me. I think it's people's first question. Because we don't have an IR lab at Stanford. I guess the three of us are the IR lab at Stanford and also Keshav Santhanam, who is a PhD student in Matei's group. We work together. We are basically this collective IR lab and it's a really exciting place to be, I think.

I think it was Adina who said that point about the picking a paper that you really like and extending that. So that's great.

Chris Potts:There's a great discussion in there toward the end, of advising styles and searching for questions. That makes sense to me, yeah.

Hey Omar, we actually have a question that's about Multi-hop QA. This is from Qi and I like this, because it's again, a new perspective. So what about the strategy of taking a question and decomposing it into some sub questions that you would query in hopes that would end up retrieving different documents and then you would pool the information. That's a different approach to the one that we've taken so far, right?

Omar Khattab:Yeah. And that approach is out there. There's a few papers that explore this. It's a very good idea. Very natural idea. I think this speaks to the point that a lot of the ideas are very natural and it's just how you get there.

One of the difficult elements of this is supervision. How do you train a model to what are the questions that it should generate? Usually we don't have data that would align with that. That doesn't mean it's not doable, but that is one of the reasons some researchers tend to prefer things that are more representational and vector based, because you can do more end-to-end stuff.

Certainly, we don't take an extreme view of this. Retrieval is a way where you actually move text around as opposed to representations only. And so that might be the future. It could be question decomposition and you could imagine a very rich world where things that are autoregressive, like language models or whatever they are, help you by creating candidate decomposition questions. Basically, I think the future is towards using as many of the tools at our disposal as we can. Retrieval is not all you need, to be honest, but also language models are not all you need. So you should use all your tools.

Aasavari Kakne:I think one of very nice things we can do with this is we decompose a query in different parts and each different part can speak to a top-K documents. And then we pull from them, which can tie into the previous point you mentioned, where we do multi-hop pooling to generate a final answer.

Omar Khattab:Yeah. That could be a start. I think one of the challenges in multi-hop that we sometimes see that makes it a little bit harder is that sometimes you have dependencies. You can't even create query number two or three or four until you've looked at the results of query one. So that's one of the ways in which it ends up being richer. But I think that would be the start. It's like, can you just use two queries and how far could you go with that? And with existing datasets having a lot of artifacts, that does take you far, I think.

Chris Potts:Another question. This one is from Imran. This is a question that has come up in discussions we've had. What's the state of thinking around using a neural IR model like ColBERT to query into a structured database where there might be a lot of metadata and other things? Do we need to do something different in those cases?

Omar Khattab:I think the emerging state of the art in the literature – we haven't looked at this in our lab – but I think there's a bunch of work from Meta AI and other folks. Maybe UniK-QA is a keyword, if you want to look up a paper. People linearize texts, they linearize tables. They convert these structures, and knowledge graphs, and tables, into texts in very, very, very simple ways. In the most straightforward way you can imagine. And they simply use that to encode things with the standard Transformers that we have. With a little bit of training, the model recognizes the structure just fine. It's back to the tension of how much structure do you need versus maybe just hoping that the Transformer generalizes. But it's always good to know that, with Transformers, we always have at least a starting baseline, where you could just throw at the problem and get some traction. And then you can be more thoughtful if you need to be about a particular modeling problem. And that's better than not having a baseline.

Chris Potts:I'm only laughing because it's such a weird world. Until so recently we used to work so hard to take our natural language text and turn it into a structured thing. And now we take our structured things and try to treat them like natural language texts, because that's somehow gotten to be better.

Omar Khattab:A lot of this is really about modularity. People have built – obviously it's a true observation – but a lot of this I think is not necessarily fundamental science as much as it is, hey, there's all this huge investment into these pre-train models for text, not for tables. So hey, let's be opportunistic a little bit and just use what exists.

Chris Potts:That's a good point. Yeah. It is very opportunistic, but it's working.

Omar Khattab:It's working. Maybe. A little bit. Sometimes.

Chris Potts:That's right. You have to add that as a qualifier for everything that happens, that's a separate matter as well.

Aasavari, did you get any questions from students about QA and stuff or retrieval? No.

Good, so let's switch gears a little bit Omar. So let's think back. I think I have the dates right. So this would be spring, summer 2019, which is between 3 and 10 years ago. I've come completely lost track. You're about to start at Stanford and you're thinking ahead: are you doing now what you expected to be doing then?

Omar Khattab:I can tell you, it's a complex yes and no. I'm definitely not doing exactly what I thought I would be doing at application time. Because we apply in the fall. So, fall 2018.

The yes comes from what I wanted to do at that time when I pitched. I had a completely systems-focused – systems for AI-focused – application. My goal there was to serve programmers. It was to build frameworks that make building AI applications easier, that are scalable, that are efficient. That's what I did during my undergrad to some degree.

In particular, I wanted to help people build search-related and NLP-related tasks. And in a way that's what we're doing, when you think of our open source code and systems. I think they're used fairly widely. And they are components that are modular, that you can plug into your system and use them. So in a way, that key element is still there. But by the time I accepted the Stanford offer, I really wanted to do something related to IR. And that was not clear, that that was going to be possible, because that's not what I applied to do. And there are no fully IR-focused faculty at Stanford. And really not at most places I had applied to – it's not as common in the US, maybe, as it is in Europe, in terms of large IR groups.

So it definitely was a great surprise to me that Matei and you, Chris, were so open to that. Taking on those challenges together was extremely rewarding.

What I wanted to do at that time, when I was accepting the offer was build what Jimmy Lin refers to as "self-driving search engines". I think we have a slightly different understanding of what that means. I think Jimmy was thinking – this is in one of the papers, I can't remember which one, but it was more about automatic parameter tuning for at that time classical models. This was still no BERT and no such things. But, in my mind, it was, well, search is everywhere. You need to search your desktop, your email, your eCommerce site, the web, fractions of the web that you cared about, Wikipedia, it's just everywhere. And folks are using usually fairly off the shelf algorithms that are fundamentally based on research from the '90s – early '90s. And so that's older than maybe a lot of folks here. And that just didn't sound right. So seemed like: could we use all of the weak supervision signals, all of the advances in NLP, all of the clicks and interactions, to gradually and automatically supervise systems that work for search in any setting and keep getting better just by virtue of getting all of these user interactions. So that would be a nice self-driving search engine.

So are we working on that? We're not working on that exactly, but I think the fascinating thing is that something like ColBERT, or ColBERT V2, out of the box, without any of that self-driving and tuning, is much better than I would've hoped by the end of the PhD. Obviously a huge part of that is just huge advancements in NLP. We can borrow them. And I think it's a nod to this idea of foundation models that give you a lot of tools that you wouldn't have prior, but you still have to adapt them to your tasks and think about what exactly you'd like them to do and customize it to your goals.

Chris Potts:Right. When you applied, did you have in mind to work with Matei?

Omar Khattab:I did have in mind to work with Matei! I was really under informed in undergrad, though. I was in CMU Qatar, and we did really good research that I think, and it's a very active, tiny community. In terms of the number of students and the number of faculty, it's very small program. But I think it's very strong. But being myself, I obviously couldn't get the same centrality that we have here at Stanford and the visibility. You get talks from all of the top folks in top schools here all the time. It's just in the middle of the action. I didn't know a lot of faculty, including at Stanford and elsewhere. I really knew my small number of faculty, particularly people whose work engaged with in my research.

And obviously one of those folks having worked on frameworks for efficient systems was certainly Matei, and I knew you, Chris, from the 224u, your release on YouTube the first time, which I think in summer 2019. And that was my introduction to BERT. I hadn't engaged with BERT before then. So that was just like, it was the perfect match for working on IR without IR faculty. We have the NLP side of things and the scaling for ML of things. And I think that really enriches the ideas that we bring, as opposed to just being within the ecosystem of IR. That's not a sustainable thing because, as you work on it and maybe get influence and influence the field, you become more of an insider, but it can be an advantage.

Chris Potts:The IR dynamic that you mentioned is interesting, because I do feel like it's true that IR left academia for a while. I think the widespread perception about that and about machine translation and to some extent speech technologies was that the systems were just too large and therefore they were only done in industry where you could have long term infrastructure and stuff. For all three of those fields, they've returned to academia if they had ever really left. And I think that just shows you that you get these cycles of different kinds of innovation happening in different environments. And so I think IR is going to see a resurgence in academia as well. Meanwhile, I think it's going to become more and more central to industry applications as well. Especially since now they can be question answering systems, which was always the dream and that dream had been forgotten for a while, it seems.

Yeah. Cool. And so say a little bit more, what was it like to apply? You were in Doha at the time as a student? What was it like to apply to US PhD programs? Did you know you wanted to come to the US or did you apply globally?

Omar Khattab:I only applied in the US. I started an application to Waterloo, in Canada, but I didn't finish just because I felt like I had applied to enough places. And I think Waterkii is maybe slightly later or something. It's definitely a great school. I just applied to enough before that. So I only applied in the US.

Chris Potts:Did you have a plan B, because it always has a lottery aspect, these PhD applications, because the top programs get so many people wanting to apply. Did you know what you would do if you didn't get into the program you wanted to get into?

Omar Khattab:I don't know what was it exactly on my mind in those days, but I hadn't really thought of a plan B.

Chris Potts:Oh, that's good.

Omar Khattab:I was just, hey, let's apply. And then my faculty were very, very assuring. I still remember, my undergrad advisor, I worked primarily with Mohammad Hammoud for my entirety of my undergrad. For three and a half years, doing research. And I remember he told me at some point one day in a few years we're going to be sitting on the same two chairs. And I'm going to tell you, would you like to go to Stanford or to Berkeley or to CMU or to MIT? And I reminded him of that. Of course, he had forgotten three years later. I reminded him of that and said, well you lied because I didn't get into MIT.

Chris Potts:Well, good! I'm glad!

Aasavari do you have questions about applications to programs and stuff or being in programs?

Aasavari Kakne:I don't have any questions about that, but I do have a question from students.

Chris Potts:Great.

Aasavari Kakne:They're asking more about your journey as a PhD student after you got in here and what are a few things that helped you grow most in your journey at Stanford? ... Omar?

Omar Khattab:Yeah, I'm there.

Chris Potts:He's just deep in thought.

Omar Khattab:It's a difficult question!

So before I get to that, let me say that admissions seems to have gotten so much harder since 2019. And that's something that I've noticed with folks I've helped to have really seller applications – much better than mine was. And I honestly don't know how to engage with that. Especially in areas like NLP and other areas of AI. But I just thought that was definitely relevant.

But in terms of growth here, I think one of the hardest elements is keeping the right level of long-term versus short-term focus. You want to publish, you want to give talks, you want to maintain your code so that folks find it easy to use. You want to add documentation and a lot of these shorter term things. And I actually enjoy them, but I do feel guilty focusing too much on them compared with maybe longer-term things like what problems are really going to be defining and do we really want me to squeeze two or three extra points in this benchmark? Or should we be building the next thing instead.

Do you want to think of investment in terms of cleaning your code and setting up your systems and infrastructure and scheduling things the right way as a short term thing, or because it has to be done now, or as a long term investment. Where do you fit that? And then dealing with working from home, or when you had to work from home versus when you don't really have to do it. These are just questions I didn't really offer any answers, but I think it would take me a lot more time to offer a full answer.

Chris Potts:Well, one thing there, Omar, that we've talked about and that I feel quite hopeful about for our field is that we are more and more valuing production-level code as a contribution. And I think people are getting better about citing the relevant libraries, and that is creating lots of incentives for research to really make their code usable, because you can see that that's a path to impact. Places like Hugging Face are, I think, facilitating this because people are uploading not only code, but also models and things like that. And as long as we keep on making sure people know, they need to cite that stuff, it's going to be in everyone's interest to really work hard on the code. And I think that will push all of us forward.

When I started doing NLP, you would spend most of your time on a project implementing baselines, because there was hardly any code released, and you'd have to figure out what the original people did in order to try to reproduce their number for whatever you were trying to do. And now we all just start from a position that, of course, you download the code and it works.

So I feel excited about that, and I feel excited for example, about how much work you all have done on the ColBERT repository, because I think that's on a path to having lots of users.

Omar Khattab:I fully agree, and I think that we try to prioritize is working on projects where the code release will be a lot more relevant than for reproducing the tables in the paper.

Chris Potts:Sure.

Omar Khattab:Certainly you want people to be able to produce tables in the paper, but it would be very cool to at least have a strong demo-like capacity in the model so that you can actually do something with it. And I think this applies to many tasks. It's not a small set of tasks, but when selecting the project, that's something we prioritize,: is this really a realistic task that maybe someone could engage with the model with in some capacity? That motivates a lot of working on the code beyond just run this black-box shell script, where you're going to get a table result output that should probably look like the one in the paper and that's as much support as I'm going to give you. We'd like to go beyond that.

Chris Potts:And I also think that extends to model parameters. I mean, we should talk a little bit later about the homework, our brand new homework, but along the way, students download a ColBERT model and they get an index – they can index data with it – and it's incredibly good. It's just as empowering as when you've got a Hugging Face model. It just happens to be for retrieval. And I feel like latent right there is already another stepping stone to achieving more than you could before because now you don't have to develop your own retrieval model. It's just right there for you.

Omar Khattab:Definitely do develop better retrieval models because we care about that too!

Chris Potts:Sure. Just like we want better Transformer models and stuff, but yeah, it's lifting all boats.

Omar Khattab:Exactly.

Chris Potts:Aasavari, you did have a question that's about post-graduation, and I think it's perfect for Omar because Omar has had lots of conversations with people doing startups, people working in industry, about how to incorporate neural IR. And so maybe this is a good moment to ask that your question about post-graduation stuff, if you want.

Aasavari Kakne:Okay. About when we are a Masters student, we do get to learn a lot about new things and get inspired. What are some after-graduation avenues in industry, perhaps, where we should look to and take inspiration from and particularly apply to such places? Because many times, in Stanford, most of us go ahead and apply to FAANG, and that's a great avenue, definitely. But as a Masters student, getting into the research teams is a bit of jumping through the hoops kind of thing. So if there are some startups and some research labs that already do excellent work and do accept Masters students, that should be something maybe we can consider. What are your thoughts on that?

Omar Khattab:Yeah, I think this is the right time to do all sorts of things like that because these startups are popping up all the time. We know of several. Offline, I'm happy to connect folks with them. And one of the challenges out there is, well, it's a very new field, so it's very hard to hire experts because it's a field that's emerging as we speak. And so taking the material that we have in maybe 224 seriously, I think you know more about neural IR than many folks who aren't really engaging with this. That already put someone in a very attractive position for a lot of those startups because they tell you, we can't actually find peole. Maybe you can look at the authors of the 10 or 20 or 30 groups that are working in the area, but that's not enough pool for everyone. So that's one way.

Another way is to think of maybe pre-doctoral programs and internship programs at usually academic or at least otherwise research – maybe industrial research – groups. I imagine many of them are taking students, like AI2. Meta AI might be taking folks, and other places in general. But that's more of an academic route. Yeah.

Aasavari Kakne:Thank you. And sorry to go back, but we have a question about previous things we talked about: that when you are a PhD student, while submitting papers and deadlines and qualification exams, there might be a lot of mental health things that could come up there. So how did you deal with dealing with multiple things, especially, I think, during the first couple of years of your PhD.

Chris Potts:Great question. Yeah.

Omar Khattab:I'm horrible at multitasking. I don't have a good answer. Still figuring this out. I think having and seeking – and I was lucky in this regard – seeking understanding advisors is good, so that at any point you could have a conversation and take things down or back a little bit when you need more time. I think that's important.

Chris Potts:Is anyone good at multitasking, Omar, or do people just – are they better at sequencing a bunch of tasks?

Omar Khattab:I think some people are worse at multitasking.

Chris Potts:I just feel like, to achieve something deeply, you have to be completely obsessed with it. Even if someone can switch between multiple things, surely they're paying a cost that they wouldn't pay if they just focused. And so I assume that people who successfully "multitask" are actually just good at sequencing things and hiding from you the fact that, for a lot of the time, they're paying attention to one particular thing, right?

Omar Khattab:Right. And maybe that's what I do sometimes, but I guess what makes it clearly not multitasking is, so, Aasavari, don't say this to Jay, our PhD program coordinator – I haven't done my quals and I need to get a bunch of waivers for coursework that I did during undergrad. So, like many other PhD students, I'm technically behind on a few of those less research-oriented milestones. But the department is flexible. So that's one way to multitask is forget about it until someone harasses you about it, and then you have to do it. [laughing]

Chris Potts:And what about, so Omar, do you even know how weird this has been, that you were here, so to speak, for the fall of 2019 and then in quarantine for some number of years, I don't know. Let's say two to five. Again, I have totally lost track of recent times. And now we're emerging, but has that been strange? Or I guess it's the only normal that you know?

Omar Khattab:Yeah, it has definitely been very, very strange. I feel very, very comfortable in the area here, but it can be hard to connect this to the Stanford that we all know. So it's interesting. I think working from home was quite difficult for me midway through the pandemic. I had gotten so used to it that I'm at home right now and I imagine many of us are. But yeah, this has been difficult I think for everyone and myself, I'm included in that as well.

Chris Potts:Do you have a trick then? How do you stay motivated in that context?

Omar Khattab:I think by aiming high and thinking of, well, wouldn't it be great if we can achieve this really large thing and if we can't, well, at least we shot for it.

Chris Potts:I like that! Yeah! Yeah!

Omar Khattab:I think it can be hard for me to have the motivation for the smaller things, when the social element is missing. As an undergrad or as a Masters student, maybe it's a lesser degree, there's this set of high expectations and everyone is working on the same homework at roughly the same time. Everyone is submitting the same things at roughly the same time. So it's a very clear paradigm of even the small things you should attend to. I think as a PhD student, it can be difficult to get yourself to do all the small things that you have to do because it's hard to motivate something so small and it's boring.

Chris Potts:I worry about myself, now that you say, that I'm the opposite, which is like, if I'm feeling a little low or isolated or whatever, I could get an immediate bit of feedback by answering a bunch of email or doing some task that I know I will complete. And then I'll feel a little bit better about my tasks versus the scary, daunting thing of open-endedly trying to solve something where I might, after a few hours, be behind where I thought I was when I started. And so I worry about myself that I'm like, "Oh, let's just get that short term fix of feeling like I'm making progress", and then I end up doing email all day. So that's good.

Omar Khattab:Yeah. Two different perspectives.

Chris Potts:Keep thinking big, keep thinking big.

Omar Khattab:Until the CS department is like, you really got to take your quals, man.

By the way, it's until the end of the third year. So I'm technically not late.

Chris Potts:I'm not worried Omar. We'll figure it out.

Aasavari Kakne:One last thing I'd love to add about this is sometimes we have to do, say, three courses and job interviews and there is a lot of context switching. So one thing that helped me in here was dividing days between works. So if I want to do my day work, I give one or two days off a week, say generally my weekend. And then two days for one course, two days for another course. And then one day is left for say miscellaneous tasks or job interviews or something like that. So that helps minimize the context switching and helps me keep sane during all that goes on. So I wouldn't know if that answers the question the student had, Siyan.

Chris Potts:Aasavari, that's how I can multitask, which is siloing everything. I can't literally do two things at once ever, but I also want to say that you did the classic academic thing of just passing lightly over the fact that you're weekends were two of the days where you were going to get a bunch of work done. But I guess that's a life we've chosen for ourselves.

Aasavari Kakne:I suppose so!

Chris Potts:Did you have another question? Go for it.

Aasavari Kakne:No. This particular question we discussed was from a student Cion, and I hope they got their answer.

Chris Potts:Great. Hey Omar, let me ask you first my stock question that I'm trying to ask everyone, which is: is attention all you need? You know the bet, from Sasha Rush.

Omar Khattab:Yeah.

Chris Potts:What would your bet be? Before you give me your nuanced answer, just how would you bet?

Omar Khattab:No, no, attention is not all you need.

Chris Potts:It's not. But then my second question, which I'll interleve with the first is: is a big foundation model all you need for all your information seeking tasks, like question answering, fact verification and stuff, just one big language model, knowledge store plus language capacity?

Omar Khattab:Very technically speaking, yes, because the work foundation model here is a black box. You could fit in the right things. So I'm just saying – I take that question as, "Is a big thing all you need" and well, if you have the right things, maybe. But I think that's not what you're going for.

So, to answer the questions with more nuance, I think an attention here could be interpreted as the actual attention mechanism that we are familiar with. And I think it would be very strange to assume that this is going to stick around for too long, because, well, just engaging with more realistic tasks where you have to work with books, serve long text, it's so unscalable to long text that it's not even going to work. So it's a zero on a lot of benchmarks – maybe not a zero, but it's a very low baseline on a a lot of texts without creative ways of handling long term dependencies. And once you have those, it's not attention anymore, because you have those extra creative components that you're adding.

So maybe this is a bet on the field of NLP not becoming creative enough with benchmarking, but I think we will. So, attention is not here to stay as the sole mechanism, that's I think is pretty clear. But I think you could also bring it to what you were saying, Chris, which is, is blind self-supervised training at very large scale of very large models with lots of data, instead of careful inductive biases or just bringing in modular components maybe for knowledge and other capacities, and just building the homogenous role database model with layers upon layers of the same interaction mechanisms – is this what the future is, even if it's not attention?

I think that is definitely not the future. I don't even know if it's the present, to be honest. This sounds striking, but this question has been presented a few times before, so I thought about it for a bit. And I feel like if you care about tasks that are user-facing or are realistic in other ways – self contained realistic; obviously a lot of tasks are realistic, but they're focusing on one element of the problem. But if you have a holistic task that is standalone and user-facing and requires reliable knowledge, which I think user-facing tasks generally do – requires a level of transparency that allows you to maybe hope to deploy it someday or apply it in middle- or low-stakes scenarios, like the typical standard search (not necessarily the high stake problem), then it's pretty clear to me that the state of art is not Gopher or GPT-3 or PaLM. These are nice, really great demos that show up capacities that emerge at scale, but if you were to solve problems like this, I bet you would go to the retrieval-based NLP literature, or other literatures that are adjacent to that and say, "Well, I have this task that requires engaging with knowledge in a particular way." And you could say, "Well, use RAG or ColBERT-QA or Baleen or Hindsight or whatever, or some other composition of these systems." You aren't going to use GPT-3 in 2021 or 2022.

So it's maybe an element of messaging or just focus of benchmarks. But I think it's important. And I think of it as a goal of our work and my PhD – it will be great to enlarge the set of scenarios where these more modular, more knowledge-focused models that glue different components with language models, as opposed to models that are language models, handle more and more of the tasks we care about, and then focus on the benchmarking side of things so that we do emphasize tasks where we are solving something that has downstream.

I don't think it should be defined by what has monetary value, but something that has downstream stand-aloneness, I think is one way to think about it. So for instance, sentiment is very, very, very important in the context of many downstream tasks. And I think we were, in NLP, in a world where (and you guys worked on this) where it was too hard to do anything bigger than maybe just classify the sentiment of short texts, but now we can start thinking, "Well, how can that building block help us solve, I don't know, larger tasks, maybe conversational tasks or whatever it is where sentiment appears, or maybe analytics literature analysis tasks, or whatever it is, or review analysis tasks where sentiment can play a central role, but you're not just feeding the model one sentence and getting one back one word and calling it a day."

Chris Potts:That's a really nice transition into this new homework that we've got, which I feel is a glimpse of the future. So this is the homework that you and I put together, together Omar: few-shot OpenQA. Do you want to give an overview of the task or shall I? I'm primarily interested in your conception of this problem – how you would tackle it. But we should give an overview.

Omar Khattab:So I think OpenQA is one of those tasks that you can think of in as a very, very standalone realistic thing. You are given an open-ended question. The answer is short and relatively well defined, is what we can get to you. But otherwise, the domain is wide open. You get to search for stuff, but there's no particular topical constraints in general, and to just make it more spicy – to make it harder – and that's because we often care about problems that arise in practical domains that we all of a sudden, you don't have huge datas sets that are annotated that's other specific detail to your use case. S,o to simulate that, you are only allowed to use, maybe a handful of training examples to supervise or to define your model.

We're saying, well, concretely, we are not going to train anything. You're going to use a large language model as this machine that can use its context to make predictions about a next bunch of tokens. And you're going to use that to learn the structure of the task and answer questions. But the twist is we hope that you will realize from this, that when you bring in search, based on your queries over a large corpus of knowledge intensive text – in this case, I think, mostly Wikipedia or entirely a text from Wikipedia – the language model's capacity of just predictive generation has a much easier and much more well defined problem to deal with, which is not just like, what is the most likely answer to this based on Reddit or based on the Web, which is very likely not necessarily the right one, even if you had the perfect Oracle language model over the web, not everyone knows the answer to everything. But instead it's, "Hey, if I give you a passage and it probably has the answer, can you just extract the answer from there?" That's a much more form-oriented task, which is exactly what these models are really powerful at. And so we can do this thing that, like Chris likes to say, it just seemed so impossible a few years ago, and we start to gain some traction on such a hard task with a few training examples.

Chris Potts:So where the rubber hits the road for this is you've got a large language model like GPT-3, and you can prompt it, and then the first line of text it produces is going to be treated as the answer to a question that you presumably raised in the prompt. So it's really a lot about prompt construction, and the role of the retriever is to help you find relevant stuff to put into those prompts in our context. But it gets centered around the prompt construction, and other things that you do with model scores. Do you have practical advice for students who now have to formulate prompts that do this in-context learning?

Omar Khattab:I would say experiment with various things. If you can get the language model and the receiver to interact, that would be good to. The assignment discusses things like RAG: can you marginalize across multiple passages and leverage that as a ensembling or pooling multiple copies of your model, essentially, across different pieces of knowledge? I think that could help you a decent bit.

Chris Potts:Well, one homework problem, we almost did – we brainstormed about this, but it's almost too funny to have as a homework – didn't you suggest just adding something like, "Please, model, answer my question!" as the first line of the prompt?

Omar Khattab:Yeah. So instruction is another way. You could tell the model what kinds of answers you're looking for. Maybe there are short answers maybe, and the larger the model, the more attention it will pay to these instructions, hopefully. Certainly an Oracle language model will, but maybe a small one or an existing one may. So you got to experiment because they're very brittle when it comes to these specific formulations.

Chris Potts:And then clever use of the scores – that's something we do nudge students to think about. Both the scores from ColBERT and the scores from the language model, as ways to, after you've done your prompts and gotten your answers, sort through what might be the best answer. And it's fascinating that that works because it's like, "Well, why is this information helpful?" But it seems to be.

Omar Khattab:And another thing that adds to this is it does break some of the one meaning of few shots here, but I think that's fair within our assumption, which is, can you maybe select the examples that you're prompting the model with, select the demonstrations based on the specific question you're trying to answer. If you can find similar ones, or ones that seem to be particularly instructive for the specific question that you have, that will help. It would also help to realize that not all prompts are created equal. You guys are given a significant number of questions. And so even though you're forced to engage with a language model that you can't fine-tune, not all demonstrations and prompts, and not all their numbers and lengths, are created equal.

Maybe that could be for insightful reasons that would work the same way with a human that you're describing the task to. For example, if you only show numbers as outputs, it would be pretty weird if all the questions end up being about people's names. So that's something you could pay attention to. And I think it is fair game within the context of the assignment. But there's also an element where, well, the models are brittle, and they might respond to small changes in ways that you don't expect, and observing that and telling us about it in your system descriptions would also be very fascinating.

Chris Potts:Yeah. I mean, I always say this, but it seems like students don't take me seriously, but here I really truly mean it that if students do something original for this problem, it should be their final project. Because, first of all, it's just a totally open area, and whatever they did in this short time period could be expanded and deepened into something that might lead to even more insights and better results. I just feel like we could learn so much if everyone did this for their final project. And I expect to be really surprised by what happens on the leaderboard.

For the other ones, I've done them a few times. I know what's going to happen. There are always some surprises. For this one though, I just can't predict what the best systems will be like. I mean, we have our own systems, and yours are very good. (Mine are not.) Yours are very good, but I wonder if the students will even surpass those very high numbers. I can't wait to see.

Omar Khattab:I think one thing – to connect this question to the previous ones – is it looks like you do need a language model for a lot of things. It's just, you want to have a language model rather than be a language model. And one nice thing of having this large frozen form manipulator that learns patterns quickly is, well, you guys are running this on your laptops, presumably, you don't really need a GPU for many of the things that you're doing – maybe not for all, but like most you can certainly do the entire assignment locally, I believe. But one big thing that makes that possible is, well, the retriever is efficient enough to run on the CPU only. (We do have more efficient versions on if you dig deep enough on GitHub.) And if you want to seek them and the language model, because it's this simple uniform thing, you get to batch all of those on, in this case, OpenAI servers, and you just get to do inference for everyone at the same time. So you're only paying that small cost as opposed to getting your own huge GPU for your own specific model, paying all the costs for that. So let's use all the tools that we have at our disposal that are productive and useful, and let's just be thoughtful on how we combine them.

Chris Potts:And it's really close to a vision we've discussed a lot, which is: I have a retriever, like Google in my life. I'm going to find knowledge intensive pieces of information using my retriever. And then the role of the language model is to figure out what is needed in interaction. What is the prompt telling me about what kind of answer you seek? And it's more syntactical. And it feels just perfect for me as a division of labor across these two systems. And then the fact that they can both be frozen components, and you can nonetheless get traction, it's mind blowing to me.

That's a nice transition also into a final topic for us. I think a good way to end, because, Omar, this is the second time you've been on our teaching team, which is really wonderful for us. And so at this point you've seen lots of teams and lots of student projects. Do you have any tips for the students on how to choose a topic and pursue it successfully in the context of this course? And then maybe beyond the course?

Omar Khattab:So Rishi and Douwe and Adina have shared topics.I know that Douwe has this long list that he shared. I think these are all very exciting. And obviously your homework projects are a good source of building blocks if you want to get a bit of a head start, and hopefully build something bigger as a result of that. I would say, talk to us as a teaching team and talk to us as soon as you have initial ideas. All the folks on the teaching team are going to be very helpful. And if it's not my area, I'll direct you to another person on the teaching team who's more familiar and can always direct you to the right person. So that's one thing, talking to us and starting early, which is now, if you can, and at least in terms of brainstorming and such,

This is a delicate balance to make, but my perspective is there's nothing wrong in aiming high, maybe not absurdly high, but aim high, but always build with your mentor milestones so that any milestone you've reached, you have a paper you can submit, and we could always extend your work later accordingly. So it's fine to have shoot for, I don't know, a fully fledged EMNLP paper, but definitely make sure that, at a weekly level, or at a regular cadence, there is something you can show.

Maybe this is, I don't know, breadth first or something. There's always an idea, or there's always a bunch of hypotheses. There's always a bunch of data sets. There's always a bunch of results and things that you're showing. And then you can iterate, add one more dataset, or add one more hypothesis, or add one more set of results, as opposed to building up this huge set of really thoughtful hypotheses, with literature review, and that's the paper.

Chris Potts:Totally agree. I love that process.

Hey, we got a concrete question from Raul that I think is really nice, because it draws on themes of pushing beyond English. I think it would be great for the field if we were all less focused on standard English. So what would you advise for a student who would like to use ColBERT or a similar retriever model, but over some non-English data? Is there a clear path for them toward a project for the course?

Omar Khattab:There's a few – and I can point you to them, if you reach out – there is a few systems ColBERT-X and I forgot all the names – but there's a bunch of multilingual ColBERT-inspired models out there that are already trained. And I think some of them are publicly released, including the weights and checkpoints. I think the overall paradigm for these models – some of them are very rich, but I think the basic idea is to use automatic translation. ColBERT is trained on MS MARCO in its original version, at least the original ColBERT. And there are translations into many languages of MS MARCO that are already out there as a dataset – it's called M MARCO. And so if the language you want to with is one of those, that's an easy way to go about this. But yeah, that sounds like something we should discuss. If you're interested, I have office hours on Wednesday, and I'd be happy to mentor a project like that if you want.

Chris Potts:Yeah. This connects with another vision I have for all of this, which is that possibly, we could say to a student: use that multilingual MS MARCO and create or just download the parameters for a multilingual ColBERT. And then if there isn't a QA data set for your domain, just write down a bunch of questions, or something like that, that you think would be interesting about the domain that you have indexed. And I feel like that could work, as long as they don't spend too much time writing the questions that you need for development. Really all you need is assessment data, and you might have a meaningful QA system and you could push in whatever direction you wanted. It could be factoid. It could be something that involves very advanced domain specific knowledge. This could work finally!

Omar Khattab:I think so. And I think depending on which language you'd like to engage with, there's a few datasets out there like XOR-TyDi.

Chris Potts:There's XQuAD. That's seven languages – gold Data, translations of SQuAD's dev-set.

Omar Khattab:Which could be interesting because you'd have to change it from standard MRC to OpenQA if that's what you'd like to do with retriever. XOR-TyDi is cross-lingual, not multilingual, OpenQA. which means that you're searching across languages. So you might get a query Japanese, but your documents are in English or something like that. And that's also an interesting thing to consider. So there's a bunch of benchmarks. So make sure to not miss any major literature, and talk to us, and I personally would be happy to help.

Chris Potts:Very cool!

Well, thank you so much, Omar. This was really rewarding. It was great to hear about your journey and about all your research. And I expect students to now flock to you for advice on doing projects with ColBERT in lots of interesting domains. I certainly hope that happens, because I think it'll push all of us forward. It's very exciting stuff. Thanks again.

Omar Khattab:Thank you.

Chris Potts:And thank you everyone. Thank you, Aasavari. This was wonderful.

Omar Khattab:Thank you, Aasavari. And thanks everyone. I'll be looking forward to any conversations we might have in office hours or otherwise.

Aasavari Kakne:Thank you Chris. Thank you, Omar. It was fantastic lecture.