CS224U Winter 2012 Homework 6

In this assignment, we'll explore relation extraction from the web. In particular, we'll experiment with a bootstrapping approach to relation extraction, as depicted in Figure 22.16 of Jurafsky & Martin.

Let's suppose we have a brilliant idea for a startup which will be to literature what IMDb is to movies. (OMG, genius, we're gonna be rich!) One table in our database will relate authors to the novels they have written:

Ayn Rand Atlas Shrugged
Joseph Heller Catch-22
Vladimir Nabokov Lolita
James Joyce Ulysses
(etc.) (etc.)

In order to populate this table, we're going to try to extract (author, novel) tuples from natural language text on the web, using phrasal patterns containing wildcards, as described in section 22.2.2 of Jurafsky & Martin.

Google makes it easy to search for phrasal patterns on the web using a feature called "fill-in-the-blank" search, in which you use the * character to represent one or more wildcard tokens. As a warm-up, try these queries on Google:

"* is a novel by *"
"* wrote the novel *"
"the novel * was written by *"

(Make sure you include the quotation marks in your query.) Spend some time looking through the results. Note how a portion of each returned snippet appears in bold. Google is trying to guess at the appropriate boundaries for the matching phrase, and it doesn't always get it right. Note also that Google sometimes allows minor variations in the match. For example, the query "* is a novel by *" yields as a match "David's Buick is a new novel by Bob Saar". Let's not get hung up on these quirks here — they need not distract us from the main thrust of this exercise.

OK, now that you've warmed up: suppose we're going to begin populating our database using the first pattern above, namely "* is a novel by *". In our initial attempt, we'll retrieve all search results for that query, take the matching text from each snippet, and then construct a tuple from the subphrases corresponding to each wildcard (again, without fretting too much about how we find the right phrase boundaries — let's just assume we can do that reasonably well).

Answer each of the following questions in a few sentences — no need to write a treatise here.

  1. [1.5 points] Show one or two examples where the resulting tuple is close to something we'd want to put in our database, but not quite right. Can you suggest some simple strategies or heuristics to fix such examples up?
  2. [1.5 points] Show one or two examples where the extracted tuple is definitely not something we'd want to put in our database, and there's no easy way to repair it. What strategies might we use to identify and filter out such tuples?
  3. [1 point] Show one or two examples where our strategy works like a charm: we discover a new (author, novel) tuple to add to our database, and both names are clean and correct as is.

OK, suppose that we've found effective (if not perfect) solutions to the problems highlighted in questions 1 and 2, and we've used our first pattern to find several thousand (author, novel) tuples. But we know we're still missing many thousands more, so we're going to use the tuples we already have to identify additional extraction patterns. This time, enter the following search query:

"Ayn Rand" "Atlas Shrugged"

Again, spend some time perusing the results. Do the results suggest any good patterns that might be used to identify additional tuples?

  1. [1.5 points] Find an example of a pattern that is likely to be precise (i.e., matches to the pattern are likely to be valid instances of the relation) but unproductive (i.e., we're unlikely to find many other matches). What strategies might we use to eliminate (or improve) such patterns?
  2. [1.5 points] Find an example of a pattern that is likely to be productive (i.e., we expect to find lots of matches to the pattern) but imprecise (i.e., many or most of the matches will not be valid instances of the relation). What strategies might we use to eliminate (or improve) such patterns?
  3. [1.5 points] Are there other characteristic ways that a pattern can go wrong? Show one or two examples.
  4. [1.5 points] Find an example of a pattern that looks likely to succeed. Then, use this pattern to search for some new tuples! Did it work as well as you expected?