CS224U: Natural Language Understanding

Podcast episode: Marie-Catherine de Marneffe

November 7, 2022

With Chris Potts

Leaving Ohio, being back in Belgium, organizing NAACL 2022, reviewing at NLP-scale, universal dependencies, and doing NLU before it was cool.

Chris Potts:Welcome, everyone, to the CS224U podcast. Our guest today is Marie de Marneffe, currently an associate professor in linguistics at the Ohio State University, and also a professor at the Catholic University of Louvain in Belgium, and an associate researcher at the FNRS there.

Marie started doing natural language understanding during her PhD research back in 2006, long before NLU swept the field of NLP and indeed all of AI. And in this work, she has always combined deep linguistic analysis with state-of-the-art machine learning techniques. She was also part of the original Stanford Dependencies team and her research is strongly informed by cognitive science and psycholinguistics. And all these strengths come together in research that's compelling, both from a linguistic standpoint and from the point of view of developing language technologies.

So, Marie, welcome to the podcast. It's great to see you again! It's been a while. My first question,: how does it feel to be back in Belgium? Are there things that now appear strange to you after spending so much time in the U.S.?

Marie de Marneffe:Hi, Chris. Thanks for the invitation. I'm very happy to be here and to see you again.

Yes, when I came back to Belgium, I felt a bit like an alien even though it was my country. And what was really weird was that I look like a Belgian, I talk like a Belgian, but I was not really a Belgian anymore. In the U.S., I'm always looked at as a foreigner. I supposedly have this charming French accent, but so people know I'm not from there, and it's okay if I act weirdly. But in Belgium, people really looked at me like, "How come she cannot use the bread machine at the grocery store?", and stuff like that. So, the beginning was slightly weird but, as humans, we adapt quickly, and so now it feels like home again.

The one thing that's slightly weird still is the personal space. I kind of acquired the U.S. one, where you stand a bit farther apart from people than you do in Belgium, and maybe it's part of COVID also, I don't know. But I sometimes feel a bit oppressed because people come very close to me.

Chris Potts:Well, give me a new life skill here, what is the deal with bread machines? I'm not even sure of the context there.

Marie de Marneffe:To cut your bread, to slice your bread at the grocery store.

Chris Potts:Oh. Oh.

Marie de Marneffe:The machines look different than the ones in the U.S., and I was standing there at the grocery store having no idea where I had to open it, put the bread. And it's a slightly older lady who then says, "Let me help you, madam. I will show you how to use this."

Chris Potts:I think the only time I've encountered a slicing machine in the U.S. in a store was in some country store in Maine, or something. It seems very quaint to me in general. Of course, we have machines do this for us out of sight. So I would be flummoxed too by a bread machine.

Marie de Marneffe:Yeah, the fresher bread, I guess, you could slice it in a machine. They had those in Ohio, but somehow they look different than the Belgium ones.

Chris Potts:So I guess, you had a kind of warm-up, because you were there on a sabbatical or something for a while, and then back, and then there again.

Marie de Marneffe:Yeah, exactly.

Chris Potts:What about your kids, how are they adjusting?

Marie de Marneffe:They are adjusting well now, but at the beginning it was slightly strange to have this more coercive, I guess, education system, I would say, where I suppose in the U.S. it's more like positive, and even if you get things wrong, you are encouraged a bit more. In Belgium &mdash I suppose it's better than when I was a kid &mdash but still they're going to use red to mark stuff that's wrong.

To give you a concrete example, in Belgium, they have this concept of a calendar. It's like two pages for a week with Monday, Tuesday, Wednesday, Thursday, Friday, the weekdays, where you put what you've done during the day and what you are supposed to do for the next day. And Aliénor, our daughter, who is now 10, but she was eight when we came back, she's a bit out there, and she wrote on Friday what she was supposed to write on Thursday. And then suddenly she realized that she was missing a day. And so she then she put what was supposed to be on Friday in the Thursday cell. And she put arrows to say, "Okay, this one is here. The other one is there." But the teacher really struck everything off in red saying, "What's on Thursday should be on Thursday, what's on Friday should be on Friday." Whereas I feel that in the U.S., the teacher might have thought, "Okay, she got it, she will be correct next time." And maybe a little smiley and say, "Get it right next time. Thursday goes on Thursday, Friday on Friday."

Chris Potts:Oh, this is also causing me anxiety &mdash the idea that I would be judged by my scheduling, which I'm sure for both of us can be a little bit chaotic. I can really relate to your daughter.

Marie de Marneffe:Exactly, yeah.

Then, French conjugation is not easy, so at first they had some issues with that. And one of the things that actually was a bit of a shock when I came to the U.S. long ago was that students would give you back homework in pencil, which is something you can really not do in Belgium. You have to write with ink &mdash an ink pen all the time. And you have to be aligned. Everything has to be written like one inch from the side of the page. And so Timoth´e now, still, sometimes he gets remarks like, "Handwriting is too big. Don't forget to align everything." So, somewhat different.

Chris Potts:It's hard being an elementary school kid as far as I can tell. All these things would be things I would be struggling with. Do you have a spirited defense of needing to use pen on the homework?

Marie de Marneffe:I don't know. I think it comes from way back in the days where, with a pencil, you can correct, you can fix it afterwards.

Chris Potts:Oh.

Marie de Marneffe:And I think this is something that teachers didn't want. What happens if then you fix it and then you say, "Actually, it was correct. I don't understand why you have it wrong"? I don't know, that's my idea of why you have to use ink.

Chris Potts:I love it. This is leading into my next question for you, which is, are there things that you miss about Columbus, Ohio? And I should say for the listeners that I consider myself to be a kind of adopted son of Columbus because my in-laws are from there. We used to visit Columbus all the time. And indeed, I visited Marie a few times at her charming house off in the woods somewhere. So I have a fondness for Columbus. So, surely, there must be many things you now miss about the Midwest.

Marie de Marneffe:Yeah, we do. We actually do, right? The first thing was the house that you mentioned &mdash we really liked the house, and all that came with it, all the neighbors. And that's something I think we've never experienced anywhere. Even though we love California too, we didn't have this kind of neighborhood life that we really enjoyed for eight years in Columbus, where all the neighbors became friends and you could do all sorts of stuff. Just: you are missing a lemon for a recipe, you can go knock on the door and say like, "Can I borrow one?" "I really need a drink today and you could..." Or, "There is an emergency, can I drop my kids?" Things like that, which is something that we really miss. And then of course, work-wise, the department was really great. Stanford too, of course. But as a professor, I really had great colleagues. And I miss, a bit, the atmosphere, the efficiency of meetings.

Chris Potts:Oh, interesting.

Marie de Marneffe:I think Americans are really great at making thing work in a definite amount of time that's a priori defined. And I've been struggling now in Belgium with those endless meetings.

Chris Potts:Oh, so they can just go on in a kind of open ended way? That's the issue? They don't finish on the hour sharp, necessarily?

Marie de Marneffe:First, they don't start on time. And so then, of course, they go longer. They have way too many meetings, I think. I try not to be too vocal about that, but I'm starting to be a bit vocal about that.

Chris Potts:So if you arrive at one of these meetings and you're like, "Okay, everyone, let's get down to business. It's one minute past the hour", do they all whisper that you're now like an American?

Marie de Marneffe:A bit, yeah. Like, a recent meeting, they started moving chairs, so that the room would be in a better setting, which I understand. But then I said out loud, "But the meeting was supposed to start two minutes ago and now we are moving chairs?"

Chris Potts:I love it.

Before we talk about research stuff, there was one other high level thing I wanted to get your feedback on, which was your experience of being a program chair for NAACL 2022. This sounds unbelievably stressful to me. I was wondering if it was for you. And in general, what this was like.

Marie de Marneffe:Yeah, it was slightly stressful, but, I mean, in everything there are pros and cons. But that's when I really thought often of Dan Jurafsky, who had told me, when I left Stanford, to learn to say no. And I kind of regret it, but I didn't say no. But on the other hand, it was also very rewarding. I think the most rewarding thing was the interaction with the other program chairs, who are really now friends. And then all the people involved in making NAACL a success. (I think it was a success.) So many people, all very responsive.

I knew, of course, from the name and a little bit, Dan Roth, our General Chair, but he's really a wonderful guy, always so optimistic and very calm &mdash generating a peaceful feeling, which was really great. And so having close interactions with people like that was really, really nice. That I enjoyed a lot.

We'll not talk about ARR, right? That was slightly stressful. But overall, I think it went well. And then during the conference, it went well also. And I also think that when you take a job like that, then you maybe realize and regret, maybe, not thanking other program chairs more.

Chris Potts:I know what you mean.

Marie de Marneffe:So, I really appreciated at the end of the conference when a few people came and said, "Thank you, that must have been such a job." It felt really, first, that we were understood in a way. And I wish I had done that myself for other people in the past.

Chris Potts:Maybe past program chairs are especially likely to thank you profusely.

Marie de Marneffe:Probably, because then you realize what it means.

Chris Potts:So, as a program chair though, what's the primary responsibility? Are you more on call during the reviewing and program design phase, and then the actual event, you hand it off? Or did you also need to be very attentive onsite for NAACL?

Marie de Marneffe:I think the job probably shifted with the new modality of being hybrid because then we still had to do a lot onsite to collaborate with Underline &mdash make sure that the different parts of the onsite program and online program were okay.

So we were actually never really off the hook there. I thought it would be slightly more calm during the conference and we could enjoy the conference a bit more, but that wasn't really the case, not because of the fault of anyone. I think it's just the nature of the job now, to make sure that everything is fine. There is a lot to do. And also, we probably had a harsher year with Priscilla retiring. And so we helped her replacement, who is really, really great, but of course needed a little bit more help at the beginning.

Chris Potts:And of course, Priscilla, she seemed like she was always around. I don't know how long her tenure was at the ACL, but she was sort of synonymous with the organization. She must have known everything.

Marie de Marneffe:Exactly. Exactly, right. And so replacing Priscilla is not an easy job, so kudos to Jenn for doing it.

Chris Potts:For the program selection phase, I think people would be really interested to just have that demystified a little bit. Do you mind walking us through how it works from, I guess, selecting Area Chairs and having them select reviewers? What's the whole process like up to actually selecting the program?

Marie de Marneffe:So again, things changed slightly this year because we opted to go with ARR. And so the reviewers were not chosen by us, which in the past would've been the case &mdash that the Area Chairs would choose their reviewers. And so that led to some complications. So what we opted to do, since we had opted for ARR, we chose the Senior Area Chairs, but then the rest of the hierarchy was ARR.

Chris Potts:Oh, yeah, of course. Even the Area Chairs, right? Yeah.

Marie de Marneffe:But they didn't really have areas, at the time, so that was another difficult thing. But then we were really part of ARR &mdash in their weekly meeting, finding more reviewers, sending reminders, keeping the timeline, all of that. So we actually worked really a lot. Maybe the frustration that came is that, we were working a lot, but we didn't have full control, whereas in the past, the Program Chairs would've had full control.

Chris Potts:Sure.

Marie de Marneffe:And so we decided, for the special theme, to have something apart, separated from ARR, because there we wanted to be able to choose our reviewers knowing that they were reviewing special theme papers. But then the interaction, putting back everything together, also led to issues, as you can imagine.

Chris Potts:Oh, because the volume of work is just tremendous at this point.

Marie de Marneffe:Yeah, it's a lot of work of papers. But I think it was really good because we managed to get all the reviews and meta reviews on time. You accepted to do some reviews at the last minute, so thank you very much for that. And then Senior Area Chairs give their recommendations for the papers. And then we made another pass to decide on the definite accept or reject.

Chris Potts:Right. So you come in at the final stage, synthesizing all these recommendations from the Senior Area Chairs, who've tried to process all of this information coming from the reviewing systems.

Marie de Marneffe:Exactly.

Chris Potts:I see.

Marie de Marneffe:And that's where also I think one of the plusses that I see of ARR is trying to alleviate some of the work, because often there are some papers that get rejected somewhere that are resubmitted and you actually review, multiple times, a paper that didn't change that much. And so the goal here is to try to make this more efficient. But, on the other hand, since there are a lot of reviewers, Senior Area Chairs had to deal with many more different people reviewing than what we had in the past. And so then putting everything in a balance, when you have to run the papers that you accept in your area, is less easy than before. Does that make sense?

Chris Potts:Sure, sure. Did this whole experience give you a new perspective on reviewing? Do you feel like we need to fundamentally change the system for scientific reviewing, or just tweak it?

Marie de Marneffe:Yeah, I think it's hard. I think changes are always hard, because people are used to one thing. But I think, on paper, the idea of ARR was extremely good. I think there were some things that didn't run very smoothly in the implementation. But if the community's accepting to shift the mindset, it might work really well.

Chris Potts:One thing that's interesting to me is, I've been involved with TACL for a long time, Transactions of the ACL. I think that system works really well. It's basically just the standard reviewing system for a journal, with the twist that if your paper is accepted, you get to choose one of the ACL conferences to present at. And if left to my own devices, and I wanted to overhaul reviewing in the ACL, I would've just scaled that up. And I think the main drawback there would've been that people like you, Program Chairs, would have very little say in what the program actually looked like for the conference, because all these papers would've been accepted by TACL editors. And I assume that's why ARR didn't go that route. But the result was this kind of limbo you're in, where ARR never quite accepts your paper. You just get more scores and feedback, in the hope that someone, somewhere, will look kindly upon the work. And that's never felt optimal to me.

Marie de Marneffe:No. Exactly. I was going to mention TACL. I think TACL is working really well. And, actually, right now, recently, that's only where I've been submitting my papers, because I really like the quality of the reviews. But they don't get that many papers per cycle where... I mean, the problem... I don't want to use the word problem, but since ARR suddenly was viewed as this conference reviewing, people still submit at the deadline for the conference. And the system was hit by more than 3,000 papers, which is really a lot.

Chris Potts:Yeah.

Marie de Marneffe:That's where things are breaking.

Chris Potts:Sure. But so TACL would have a scaling problem if they just said, "Hey, this is the new reviewing system for conferences." But ARR had exactly the same problem, so no one had a solution there. I guess, mine would've been to really embrace the TACL thing. And then I think actually people would submit less on the cycle. Because when I submit to TACL, we just do it on the monthly thing without really thinking about what conference it'll end up at. And I think that does distribute the TACL load. I don't know whether they experience huge spikes timed to conferences or not, but I suspect not anything like what ARR was experiencing with those spikes.

Marie de Marneffe:Yeah, I don't know the numbers exactly. As an action editor, you don't see those numbers, for TACL, I mean. But I don't know... I think that's what I meant with: the community needs to change mindset. Then you don't submit for a conference, but you will see what gets accepted. But then, as you were mentioning earlier, the drawback is then, what's the possibility for the Program Chair to shape the program? How do you choose a special theme, for instance? Then you still need to have a track for the special theme. That could be the case actually. We could have something like that. So I think what's clear is that we need to think cleverly and outside the box about how to make the system much better. But I think ARR is trying to fix things, and we'll see how it goes for EMNLP, I suppose, and then ACL.

Chris Potts:But your NAACL might be the last conference that does only ARR. It seems like the new norm is to do both.

Marie de Marneffe:Both.

Chris Potts:And my prediction would be actually that what they evolve toward is just dropping the ARR thing entirely and doing their own reviewing.

Marie de Marneffe:Probably.

Chris Potts:Because even though it's a ton of work, it's at least work you control, and it's time delimited, and so forth. It has some nice properties. I was thinking that if they did the TACL thing, the conference organizers could still do their special theme tracks and they could do it in SoftConf in the traditional way. It would just be a much smaller job, because you would be getting only track papers, not 8,000 of them, or whatever the number is these days. It sounds workable to me.

Wait, did you mention: are you also an action editor for TACL now?

Marie de Marneffe:Yes, yes.

Chris Potts:Oh, yeah. I did that for a long, long time, but I recently stepped down because I feel like I can have a larger impact as a reviewer. And I also was finding it increasingly... well... not my favorite work, to be responding directly to my colleagues as the editor where I have to be very critical and reject a lot of work as myself. I know it has to be done, but I was finding it kind of exhausting emotionally.

Marie de Marneffe:Yeah, that's true. Yeah.

Chris Potts:Because in my time as a TACL editor, I only accepted like five papers, not because I'm harsh, but just because that seems to be the way the thing works, where most of the time you're saying work needs to be at least revised, but often rejected.

Marie de Marneffe:Yeah.

Chris Potts:How long have you been doing it, action editor?

Marie de Marneffe:I think they asked me sometime, maybe, in January. Now I don't remember exactly. Oh, no, maybe a year ago. But then I took some time off because of NAACL. I said like, "I need a break here, I cannot do this anymore." So, I think I took a few months' break, from April to September.

Chris Potts:For a long time I thought it was just the best reviewing work I could be doing. And the same thing happened when I was Area Chair for the ACL conferences. For the first three or four times I did that, it seems so great and so meaningful because it was a new kind of role and there was a lot of buy-in, like reviewers would discuss with you and you could really have an impact. You could get them to change their reviews and talk with each other, and that kind of dwindled as well. And so I actually stopped doing it. In a moment of weakness, I'm doing it for EACL, but I had kind of sworn it off because it got to the point where mostly it was just me sending messages into the void. I felt like I wasn't getting the reviewers to talk with me very much.

Marie de Marneffe:To engage, yeah. I had the same experience when I was an Area Chair for ACL, before we had the Senior Area Chairs and Area Chairs, where somehow I managed to get reviewers to be really engaged. I just wonder if there are not too many demands, and so people are tired. I think there is also a COVID effect, that people kind of got burned out. And that's what we also saw in ARR &mdash the Action Editors, they're trying to get reviewers to engage. Not all the time &mdash this is an over generalization &mdash but not a lot of responses. And that's sad because that's also the part I really enjoyed in reviewing, is when there is an exchange.

Chris Potts:And I'm glad you mentioned that. I completely sympathize because I've also been that reviewer who suddenly is flooded with author responses, all of them complex. And you're like, "I have a very short amount of time to try to do this." And it's socially complicated because you have your colleagues who've also reviewed, and if you're trying to nudge them... It really is just completely exhausting work. But I love the TACL rhythm because it's that version of that, but much more relaxed, you feel. And you only get typically one at a time &mdash but then the problem is scaling up, how do you do it?

Marie de Marneffe:Exactly. And I think we don't have the right answer yet to that question &mdash how do we scale up, which we need to do for conferences because we get so many submissions.

Chris Potts:Yeah, that's just a difficult question. Everything you think of has clear problematic consequences. Alas.

Well, here, let's talk about some research. This is more fun. I do want to return to that theme I mentioned, which is that you did NLU long before it was cool. And in fact, you did it in the period where it had fallen out of fashion. It used to be the only thing anyone cared about in NLP. Then the machine learning revolution thing happened in the '90s, and everyone did syntax because they understood how to apply those tools to those problems. And you were working on things like contradiction, and indirect questions, and other things, which now of course everyone is obsessed with. Everyone is doing NLU as far as I can tell. The whole field has been taken over. But what was it like to be doing this in 2006, and how did you end up doing it in the first place?

Marie de Marneffe:Yes. Well, I guess, that, in a sense, it's thanks to Chris Manning, who was my advisor. I started working on what was called at the time, RTE, Recognizing Textual Entailment, which all the NLP group was taking part in. I don't remember exactly how that happened, but looking at the data... At that time, it was only a binary annotation: entailment, non-entailment. And then looking at the data, I was like, "We actually should also annotate contradiction." And so somehow I was wondering, are there contradictions in those datasets? They were very small, compared to today: 800 pairs in RTE1. Probably also 800 in RTE2. We were already at iteration three, RTE3. And so I just sat down: I'm a linguist and I don't mind sitting down and doing manual annotation, and I annotated the contradictions. And I sent that to Chris, who happened to be at an RTE meeting, where suddenly the idea of contradiction had been discussed. And he got my email and he could say, "There are actually that many contradictions in those data sets."

From there, Chris agreed it was really a good idea to look for automatically finding those contradictions. And that's when I actually did my first ACL paper with Anna Rafferty, where we made this typology of contradiction &mdash what kinds of contradiction do we have in those news wire datasets? Genre is also very important. And then have the system &mdash back in the days it was not neural net based, so traditional features &mdash have a machine learning algorithm that's going to give you some weights, and then predict. That was really fun. And then actually RTE3, I guess maybe... Yeah. Then the actual RTE3 was with contradiction also. And that's how I got in the field of NLU.

Chris Potts:I like that that story includes you making Chris look very wise at a meeting, just unbelievably prescient. Exactly the question they were asking, he had the numbers for. That's really nice.

I'll never forget also that I moved to Stanford in 2009. Before, when I came from my job visit, you and I talked about indirect answers to questions. You, me, and Scott Grimm. And even before I arrived, you were like, "Let's do a paper." And we did actually do a paper. So before I even arrived at Stanford, I had a Stanford paper. It just felt unbelievably welcoming. I'll never forget that. And then we did more stuff on indirect questions. Again, kind of a little bit before its time, I feel, but it was really satisfying work.

Marie de Marneffe:Yeah, it was. And actually I think Yejin Choi, in her ACL keynote, talked about our paper, our joint CL paper with Chris Manning. And she said that it was a bit of a precursor, in a sense &mdash that we had this idea of having a distribution, and not having only one gold standard answer, but maybe taking into account how annotators are annotating the data. I wasn't at ACL, and I confess I haven't had the time to look at Yejin's talk, but I really do want to look at it. She mentioned that to me when we were at NAACL, and so I would like to maybe collaborate with her and see what she has to say about that topic.

Chris Potts:Oh, yeah. Maybe we can find a way to link to it in the show notes so that everyone can check it out. I'd like to see it now too. That's another area where I feel you are really ahead of your time. This notion that we shouldn't be picking a single label, for example, as an NLU, but rather thinking about response distributions as reflecting uncertainty in the world or in language. Do you want to say more about how you came to that conviction?

Marie de Marneffe:I think, again, it's because I analyzed the data closely, and then looking at the data and those annotations where we gathered at least 10 annotation per item. When you just plot what people are doing, then you realize there are some items where everybody agrees and then some where no one agrees. And so what's there? And of course, before the first idea was like, "Okay, then this is noise in the data collection. Maybe the guidelines are not really clear." But we quickly discarded that as an option because this is something we do every day. You listen to the radio, you read a newspaper, and then you decide, "This other stuff that I know about the world, is it a true or false? Or can I not say anything about it?"

And also, I think we had some good ways of doing data collection, where we had some items with clear answers. And we did discard annotators that did not reply correctly to those items. We were spending quite a lot of time in curating the dataset, I think the annotation that we were gathering.

So, I don't know if that answers your question, but I think it's really just looking at the data, and then also thinking about, "Okay, how do we understand language in everyday life?" They are sometime quid pro quo, right? We are not always crystal clear in our communication, and there are sometimes misunderstandings. A bit further in the conversation, you realize that maybe we didn't align &mdash to use, maybe, a pragmatic concept &mdash on the same question under discussion at that point of the conversation.

There is this really good paper where they say one annotation, one label per item is really a myth. Barbara Plank has really great work about how thing that maybe are more clear cut in part of speech tagging, you also see that for some items like particle for verb, right? "Put up", what's the right tag for "up"? There is endless linguistic debate about that. Can we use this actually unclear imprecision in the machine learning system to get better answers? And so why not transpose that to NLU?

Chris Potts:Sure. If you could wave a magic wand and change one thing about the field that would connect with this idea, what would you do? Would it be something around how we assess systems? Would that be the main thing? Or how we create data sets?

Marie de Marneffe:Maybe make sure that everybody gives all their data. Make available all the annotations. I think we are getting there with this idea of reproducibility. At NAACL &mdash actually to come back to NAACL &mdash Jesse Dodge and colleagues had this idea of having reproducibility badges, where a lot of people actually participated. But I think making sure that we can see exactly what annotators did. When you think about it... I don't want to criticize too much the Stanford Sentiment Treebank, but having three annotation per item feel a bit too few, right?

Chris Potts:Oh, sure. Yeah. No, I think if you want to understand the response distributions, you need much more. And that would be an easy answer like, "Okay, you just need to get more annotations per example." I feel like the field could absorb that pretty well. What would be harder would be the questions of evaluation. Do you want people to be predicting a full response distribution and be evaluated in those terms? Or do you want to do something that's much more like, "Okay, this was a split distribution. But with context, everyone goes in one direction. And so therefore, your system should be able to do that as well." I'm not really sure. I suppose it depends on the problem.

Marie de Marneffe:Yeah, I think it might depend on the problem. But I think there is actually &mdash if we come back to NLU &mdash I think there are really two kinds of data. One where somehow the majority of people converge on one interpretation. And then others where people maybe imagine other contexts. Or, even with a definite context, they can get two interpretations or different interpretations. And we do want systems that are able to tease those two kinds of data apart, I think.

Chris Potts:For sure, yeah.

Marie de Marneffe:But what exactly differentiates the one kind of data from the other, that I'm not sure. But I think that's a very nice next step of research.

Chris Potts:Well, relatedly, I see an interesting strand in your research. So, you, me, and Chris, we did this paper on uncertainty for veridicality for embedded complements to verbs, so like "say", like, "IBM said its chairman resigned", and things like that. "Chris said that he was happy." Trying to understand how often that embedded claim became a commitment of the whole discourse or the speaker. And that really did reveal a lot of uncertainty in people's construal of those sentences. And now you're doing lots of work with Judith Tonhauser on presupposition projection in these contexts, and so forth. Do you see the through-line there? Is this all mentally one project for you?

Marie de Marneffe:Yes, yes. Or maybe like a Russian doll project.

Chris Potts:Sure.

Marie de Marneffe:We had a larger thing and now we've looked at the more fine-grainer version, defined with those entailment cancelling operators. But we've also seen a lot of disagreement in the annotation with my student, Nan-Jiang Jiang. And so now we are actually looking at this in more detail.

Chris Potts:Sure, because we were complaining about the NLPers inferring a single label. Linguists were also doing this for a long time as well saying, "This verb is factive. This verb is non-factive." And that also turns out to be too much of a simplification for verbs like "know", and "regret", and "remember".

Marie de Marneffe:Yeah. Though, I guess, Lauri Karttunen had the semi-factive category also, for "discover", right? And then I think the very intriguing question is, how come the context makes you think "know" can be factive, anti-factive even? We had some great examples at the end of our paper with Chris Manning of anti-factive usage of "know". From an NLU perspective and a more formal linguistic perspective, do we have a clear answer about that yet? That's good news in a way &mdash we still have a job.

Chris Potts:Oh, well, that leads me to my next question: what's the role of large language models for you in the kind of research that you want to do? They do bring a certain fluidity to these questions. What do you think of them? Are you embracing them as interesting artifacts for doing NLU?

Marie de Marneffe:Yes and no, I guess. I've been really impressed by their abilities to find patterns that maybe we haven't really seen as linguists in the data. Maybe this is a bit too much to say, but I think that's the only thing they do, is finding some patterns in the data. As soon as it's complicated or not following what's there, we cannot say they understand language. You have the news saying like, "Those neural nets understand language." And there was this really great paper via Alexander Koller and Emily Bender about how we have to be careful of the terms we use to describe those neural nets. And that's really true.

That's also part of the work I've been doing with one of my students, is trying to see what are the limits of those systems? If we give them dataset, like this CommitmentBank data set that we built, with those entailment canceling operators, in embedded complements, they understand 90%, but what about the remaining 10%? I think that's where we really want to understand that &mdash then get the system to get those items right. Because, coming back to disagreement in annotation, those are not items on which people disagree. Most annotators do agree on the annotation, but still those neural nets cannot do it.

Microsoft had this research blog where, a year ago or I guess 18 months ago, they were very happy because their system was beating everyone else on the SuperGLUE benchmark. But at the end of the blog, they were saying that, "Yeah, maybe now it is time to have a combination between neural nets and symbolic reasoning." And I think that's where we are now in the field. How do we recombine both ways of thinking about the data?

Chris Potts:That's really interesting. Maybe that's the crux of it. Let me follow up on that because the understanding question seems very fraught. And I don't know, I go back and forth, but I certainly don't think that introducing symbols into what is otherwise just a very large and complicated numerical processing system is somehow going to lead to understanding. It's either possible for the version without symbols, or not. But the symbols don't really change this. But they might change something about the behaviors that we can see. And so my question for you would be, for the language models that we deal with now... which, if they have symbols, it's entirely induced from data, by and large... are you saying that they're intrinsically limited, that they're always going to fall short, and that therefore something else needs to be introduced into the picture?

Marie de Marneffe:Yeah, I think so.

Chris Potts:Just to achieve the behavioral targets that we're after?

Marie de Marneffe:Yeah. And I think all this work with those adversarial datasets, they've shown that too. I guess, it started with vision data, where neural nets really helped get a boost in performance. But then we saw that, if you overlay an elephant on the picture of a living room, a three year old would say &mdash what's this elephant doing floating in the room? But the system would be totally thrown off.

Probably in life there is really some statistics and some stuff in data, and that's what we know as linguists, right? And when we started building grammars with rules, those rules were working 97% of the time, or so. And then maybe those 3%, we don't really care. But maybe we became also just a bit more perfectionist in a sense. Now that we have systems that can be so good, they ought to get that 3% right also.

Chris Potts:There's certainly been a lot of moving of goal posts. Even for me, my expectations go up, and up, and up. But I guess, where I come down for this is, right now I believe there's nothing in principle that I can detect that's wrong about language models or the way they're trained with self-supervision. I think that that kind of can induce a lot of modular structure that would look like symbols. And so there's nothing inherently broken about the model we're working with or the kind of practices we have, except that we need to bring in more relevant data, because evidently the kind of data they get now is not inducing the modular structure that I'd be happy to call symbolic structure. And so you could have two responses: you could train on better data or you could start to introduce rule-like things, to a priori induce the symbolic structure that you want. And then a language model could do anything that you want behaviorally, let's just say.

Marie de Marneffe:Yeah. Or maybe... And I don't know the answer to that question, but maybe to get to that, we actually really need grounding, because when you think about how is a human learning data and language, there is a lot of grounding. I think that's the other part that is slightly difficult to grasp: we have systems that are really good, but they are trained with lots of data, more data than you would read in a lifetime. So we are not yet building systems that actually mimic how we, as humans, are learning. Right?

Chris Potts:I totally agree, yeah.

Marie de Marneffe:And so I think when you say we need to bring the relevant data, I think I fully agree, but I don't think it's clear what exactly the relevant data is.

Chris Potts:In a way, it is though, right? Because what you mean by grounding is a full sensory experience with interaction.

Marie de Marneffe:Yes.

Chris Potts:And as you say that, it's like, "Okay, well, then we need data sets that have that very rich symbol stream. And then language models will be incredible and appear to understand..."

The thing about people and the amount of data &mdash that's another angle on this for me, because I'm not so worried if these models need to get vastly more data than humans in order to achieve human-like behaviors, because these models aren't initialized by the universe the way we are. And clearly, we don't come into the world as blank slates. We have a lot of cognitive capacity right from the start. And so unless we can figure out exactly what that structure is and initialize these systems this way, the only alternative might be vast amounts of data. But if that's all it is, vast amounts of data, well, that's certainly an interesting finding.

Marie de Marneffe:Yeah, sure. Sure. Yeah. No, no, that's true. I think it's more when the press, I guess, says that they learn like humans, I'm not sure that's the right category definition of the system.

Chris Potts:I am interested in them not for a cognitive claim, but just as devices for processing enormous amounts of a particular kind of data, and then studying them. So it's kind of a very sophisticated technique for taking a reading from a corpus.

Marie de Marneffe:Yeah, I do agree about that. And I have to say that, when my student, Nan-Jiang Jiang, came back with the results of those &mdash it was BERT, at the time, run on the CommitmentBank, which was a really kind of big scale. We have annotations from minus three to plus three. Minus three, this is false. Plus three, this is true. And then we had an average of the annotations. And so you have like -2.54. And the system comes back with -2.47, and you are like, "My God, how is it possible that it does that?" It is very impressive. I don't want to say that I don't think they are interesting, but I think maybe now we are past the time of thinking like, "Oh, I have this task, I apply neural nets, and I get great results."

Chris Potts:Right. But are we in the era of, "Okay, I have this task and I do the in-context learning you just described, and I get pretty good results from a fixed artifact, a fixed model."

Marie de Marneffe:Maybe not.

Chris Potts:I have one more research question. I'll kick myself if I don't get to ask this. Pointed question. You're one of the original people on the Stanford Dependencies project. What are Stanford dependency representations, syntax or semantics?

Marie de Marneffe:Yeah, I knew you were going to ask that. I guess, it's really at the interface of syntax and semantics. I know some syntacticians out there really don't like when we say that it is a syntactic representation. But I don't know if it's a satisfying answer enough to say that.

Chris Potts:But the original aspiration for the project was just to convert Penn Treebank style trees into dependency parses. Is that right?

Marie de Marneffe:Yes, yes, with the downstream task of RTE in mind.

Chris Potts:Oh, so that's why it got corrupted by semantics! I see!

Marie de Marneffe:Exactly, yeah.

Chris Potts:So as you did this pure syntax thing and you were like, "Well, wait, but if we just add this dependency connection here, we'll be able to have feature extraction that's better for RTE. And then pretty soon you had this amalgam of syntax and semantics.

Marie de Marneffe:Yes, that's where it originally started from.

Chris Potts:Do you remain heavily involved in the Universal Dependencies thing? That's a thriving project.

Marie de Marneffe:Yes, yes, yes. We have a meeting with the core group trying to refine the annotations and the guidelines for the Universal Dependencies project.

Chris Potts:Yeah, it's a wonderful project and incredibly ambitious. The idea is to have dependency treebanks for all the languages of the world, I suppose, and have some kind of harmonized way of reasoning across them and thinking about the underlying grammars. Is that right?

Marie de Marneffe:Yes, that's the goal. And what's been really extremely rewarding is to see so many people just offering their time annotating data, because all of this is done with almost no money. And so you have this huge collection of corpora annotated with the Universal Dependency scheme. So all of that would not have been possible without all the annotators out there and treebank creators.

Chris Potts:Oh, that's fascinating. So you just mean linguists, or just people who want their language represented in the treebank, in a kind of participatory design way, they just say, "Sign me up to do a bunch of annotation." And then you try to assimilate their work into the broader project. Is that how it works?

Marie de Marneffe:They can have a branch on the GitHub for their language and they upload their data when they have an annotation. You need to have lemmatization, part of speech tagging, and dependencies, I think. And there is a validator that looks that they are not very big mistakes in the guidelines. And so your annotations need to pass that validator, but then you can be uploaded and your data is released. All of that is thanks to Dan Zeman, who is doing a tremendous amount of work in releasing the UD data.

Chris Potts:But for the people working on their own languages, they call the shots when it comes to saying like, "This is a particle, this is a preposition?"

Marie de Marneffe:Yes.

Chris Potts:And you just check for consistency at a global level?

Marie de Marneffe:Yes, exactly.

Chris Potts:Oh, that's really cool. So that's like embracing the fact that different linguists and native speakers of these languages, they might have different views. But as long as there's consistency in the overall system, you've got something you can think about.

Marie de Marneffe:Yes, yes. And we are striving for more consistency, because that's also where, when you have so many people annotating, there might be some discrepancies from the guidelines that are hard to find. But we are trying the best we can to have this harmonized enough for cross-lingual language applications.

Chris Potts:Back to Belgium: now that you're there, are there some new things that you'll get to experience in terms of research and collaboration that are easier now that you're there as opposed to being in Ohio?

Marie de Marneffe:Yes, probably for working on French, that will be easier. But right now I still have tons of collaborations with my U.S. colleagues, and, as you mentioned, Judith Tonhauser. But little by little, I guess, I will have also more collaborations in Belgium. I'm working right now with two students, one in CS and another one more in linguistics, on a project about opinion &mdash how do you state opinion and facts in French? And so that's pretty cool and interesting.

Chris Potts:Oh, sure. Very timely as well.

Marie de Marneffe:Yes.

Chris Potts:What is the mission of the FNRS and what does it mean to be an associate researcher there?

Marie de Marneffe:So, the FNRS, that's part of the Belgium government, the French part &mdash you have the equivalent for the Flemish one because Belgium is a complicated country. And it funds researchers. So their motto is, freedom for research. If your project is accepted and funded, you have all the time to do your research. And so they fund research at all levels, in medicine, engineering, literature. It's kind of like the NSF in a sense for the U.S., trying to fund fundamental research.

Chris Potts:Fundamental research, that's cool. So is that the dream job where you mainly just get to be a researcher working on the problems that fascinate you?

Marie de Marneffe:Yeah, it is.

Chris Potts:But I missed it before &mdash are you also teaching classes?

Marie de Marneffe:Yeah. Since you are with a university, of course they like you to also teach. And that's a way of interacting with students. But I think, technically, you don't have to. It's more asking the people to be nice.

Chris Potts:But that reminds me also &mdash I wanted to ask you, because you've always had these groups of students and researchers who are from both linguistics and CS, and I'm wondering how you manage that. It looks like one thing you've done is teach them courses that are kind of oriented toward bringing the linguists on board and maybe also bringing the CS people around to thinking linguistically. How have you managed that? And do you have a vision for how to do this effectively, to bring those two fields and groups of students together?

Marie de Marneffe:To be honest, I don't know if I really always managed to do it very successfully, but I do think that sometimes some of the thing that was missing from the CS part was really looking at the data. And also, even the idea of what's a research question was not something they had really encountered in their curriculum. And that was really eye opening for them, thinking, "Okay, I can couch this as a research question and then look at the data rather than only caring about performance and numbers." And then people more from the linguistic side, or arts side, learn to program and to use AI systems, machine learning systems. But I think pairing them in class was one way of bridging both fields. I don't know if that answers your question.

Chris Potts:Yes. Looking at the data part, that's so important to me. I feel like things have gotten much better. There was an era where it was very hard to get NLPers to even think about what dataset they were analyzing. The whole turn toward adversarial data sets, I feel like that shifted, and people just being aware that they could craft systematic generalization problems that would expose a problem for someone's model, that led to actually a lot of thinking about data. And that's been exciting for me.

Marie de Marneffe:Yes, for me too. That's been really nice. And then seeing more error analysis in papers, because, as you said, there was a time where there was no data in computational linguistics papers, zero examples of what the system was getting wrong. But that is shifting.

Chris Potts:Yeah, it really is. There's more linguistics in papers these days than there was 15 years ago, for sure. And I also feel like NLP has gotten more interesting for linguists than it was before. Some of that for me started with vector representations of words, which are intrinsically more interesting than sparse feature representations where you hand designed all the feature functions. And now with contextual representations, I just feel like linguists ought to be getting very involved.

Marie de Marneffe:Yes. Yes, I think sometimes the step there is to understand the system. For some, it's not that easy.

Chris Potts:Right, right. Well, that's why we need interdisciplinary teams of the sort you're creating. That's my feeling.

Marie de Marneffe:Exactly. Exactly, yeah. But if you'll remember our best paper with Marta, I think it was Yoav Goldberg, who then tweeted like, "First time it's a team of linguists who get a best paper award at the Computational Linguistics Conference."

Chris Potts:Yeah, I wanted to ask you about that too. You seem to have some kind of special talent for short papers. It's a very difficult form. And in a way, it's very hard to get them accepted, but you've got multiple best paper awards for these short papers. What's your trick for doing an excellent short paper?

Marie de Marneffe:I think having a clear message probably, and I don't remember who told me that, but I should, right? But it's really not only me who came with that, but that the key for writing a short paper was to have one clear message. Maybe it was Chris Manning, actually, or Dan Jurafsky, who had said that. Just keep that thread. Not too many things on the side, but I don't know.

Chris Potts:Right.

Marie de Marneffe:Thanks for the compliment.

Chris Potts:It's so hard to get reviewers to stop saying, "It should have been a long paper." I feel like it helps also if it's a little bit fun or surprising so that they're kind of like, "Well, it's a short paper so I'm not going to insist it be long." And then of course, it needs to be outstanding, because there's no reduction in expectations around the quality for short papers. That's why it's such a hard form.

Marie de Marneffe:Yeah, it's not easy.

Chris Potts:And your key to this has clearly been to have it be surprisingly linguistic-y.

Marie de Marneffe:Yeah. Thanks.

Chris Potts:And this relates &mdash you've alluded to this a few times, but I just wanted to ask it as a direct question &mdash how do you feel like having your PhD in linguistics, as opposed to CS, has really shaped who you are as a scholar?

Marie de Marneffe:I think it is the fact that I really learned about data structure in a linguistic way, not in a CS way. And then meeting people like you. I think I really learned a lot from you and all the pragmatics that you taught me. Then also, as you know, I did a little bit of language acquisition during my linguistics PhD, and I think that also really helped me think about how we acquire language. What do we pay attention to? How does context matter? And having kids, right? Seeing how they were emerging to language was also very fascinating. And so I think it really taught me the value of knowing some theory about language, and then looking at the data, doing error analysis and things like that. I did take some CS classes also. But there, you are slightly disconnected from the data sometimes.

Chris Potts:I was thinking also about your research &mdash that something about the empirical question that you ask is different from the standard fare in NLP. Kind of like: what are the inferences of indirect answers to questions? Or what is the source of uncertainty in labeling? These are questions that are really from a cognitive/linguistic angle that turn out to have really important implications for NLP systems. But your starting point, I would like to claim, is really a linguistics PhD starting point and not a CS one. I think that's cool.

Marie de Marneffe:Yeah, and I think that's because of my trajectory, and how I came to NLU in a way.

Chris Potts:But why didn't you then just become a pragmaticist or a language acquisition specialist? Why did you do NLP?

Marie de Marneffe:Yeah, good question. It all started when I was back in Belgium, I did classical languages.

Chris Potts:Oh.

Marie de Marneffe:That was my bachelors, in classical languages. And then the exercise that I really loved was translation: how do you translate Latin to French, or ancient Greek, and vice versa, trying to really keep the spirit of the text? If it's a rare structure in Latin, it should be a rare structure in French. And then really I was like, "It could be really cool to do that automatically, right? At the time, my question was, "Can we replace translators?" Which I don't think we can, or not totally. And so that's when I went then to CS to learn about machine learning.

At first, when I came to Stanford, I wanted to do automatic machine translation. But as I told you at the beginning of the podcast, somehow I came into NLU, a bit by accident, and I started thinking about all those questions where, if you ask me a yes/no question, and I don't answer by "yes" or by "no", you are still going to infer a "yes" or a "no". How do we do that? And then can we get machine to do that? And I think that's where I thought it was really fascinating to combine both.

I should say that, prior to doing classical languages, I had passed the exam to start engineering school. In Belgium, you have to pass an exam. And so I guess, I was going back to that side of me, more mathematics and CS-y.

Chris Potts:There's something so fitting about this &mdash that you started with machine translation, but instead of thinking, "Oh, wouldn't it be nice to have systems that could help translate a simple newspaper article or help a tourist in some foreign land", you thought, "What I want is to have automatic systems that will capture the spirit of the original text." Absolutely the hardest thing about this is what you wanted machines to do.

Marie de Marneffe:Maybe. Maybe that explains why my place was in linguistics.

Chris Potts:But also doing NLU, right? Yeah. And not only just not NLU, like the simple cases, but the hardest cases for NLU. I love that. That's inspiring.

Final two questions, can't resist asking, and then we could wrap up, just about you in particular. Are you still skiing?

Marie de Marneffe:Ah, no.

Chris Potts:No?

Marie de Marneffe:No, not really. I tried a little bit, but each time it didn't end well.

Chris Potts:Oh, not with more broken legs, I hope!

Marie de Marneffe:No. No, no, not broken, but hard fall. And I was like, "Okay." And then I cannot keep up, the kids are going too fast. They always have to wait for mommy. So I decided that reading a paper or a book on the patio was maybe more for me.

Chris Potts:I can relate, I've tried to do some downhill skiing with my niece and nephew, and I fall much harder than they do. And so I think I prefer to just hear about their experiences at the end of the day. I do like cross country skiing still though.

Marie de Marneffe:Yeah. But so last week we were actually on vacation, and my brother coerced me in learning to surf.

Chris Potts:Surf?

Marie de Marneffe:On the water, right?

Chris Potts:Sure.

Marie de Marneffe:And so that was actually pretty fun. We took a surfing class with the kids.

Chris Potts:Did you spend your whole time in California and never surf?

Marie de Marneffe:Yeah, I did. I passed my boating permit on the bay, that I did. Sailing, sailing permit. But no, I never surfed.

Chris Potts:So now you're a surfer, you and the family?

Marie de Marneffe:I don't know if we can really say I'm a surfer. But I guess, I sometime managed to get up.

Chris Potts:You did get up, so I think then you're a surfer.

Marie de Marneffe:I did get up and I managed to turn once.

Chris Potts:Oh, okay. Then clearly a surfer. That's the vague boundary. You have now far past this vague boundary. If you catch a wave, I feel, and also carve a little bit, you're a surfer. So final question for you then, are you going to be back in California for a visit anytime soon?

Marie de Marneffe:Yeah, I would love to come back. Yeah.

Chris Potts:We have to find some kind of event or excuse now that COVID seems for now to be receding and things are opening up. Surely there's a way that you could come back and visit the NLP group and everything, tell all your war stories.

Marie de Marneffe:Yeah, a stroll down memory then.

Chris Potts:Wonderful. Well, thank you so much for doing this, Marie. This was a wonderful conversation.

Marie de Marneffe:Yeah, it was really fun for me too. Thank you, Chris.

CS224U: Natural Language Understanding

Podcast episode: Marie-Catherine de Marneffe

Show notes

Transcript