HOMEWORK 2

  1. We're going to build a supervised WSD system to disambiguate two senses of the made-up word vek, as described in Section 20.2 of Jurafsky & Martin. The annotated training corpus consists of just five sentences:

              foo vek1 bar baz
              foo foo vek1 zub
              zip vek1 zip zub
              foo vek2 zip zip
              foo zub vek2 bar 

    Our feature representation will be exceedingly simple: we'll use a bag of words containing the one word before and one word after the target word. That is, for every word wi in our vocabulary, we'll have a boolean feature fi indicating whether wi appears as one of these two context words. Assume that the vocabulary includes every word observed in the training data.

    Using this feature representation, we will train a standard Naive Bayes classifier on our small training corpus, and then use the classifier to make disambiguation predictions on test examples.

    1. Write down the equation that our Naive Bayes classifier will use to make predictions.

    2. What is the prior probability of sense 1?

    3. What is conditional probability of feature foo, given sense 1?

    4. Suppose we apply our trained classifier to the following test example.
                    zip vek? zub 
      Which sense will our classifier predict to be more likely?

    5. What specific probabilities will our classifier assign to each of the two senses?

    6. [Extra credit] Now suppose we apply our trained classifier to this test example.
                    foo vek? baz 
      What difficulties might we encounter, and how can we address them?

  2. We've put a 1GB fragment of Wikipedia here. Please download it to some convenient location and unzip it using the unzip command.

    By grepping through the unzipped XML file, you can find occurrences of (Wikipedia-style) hyperlinks like the ones that [Mihalcea 2007] used to create her sense-annotated corpus. For example, try the following grep command to find hyperlinks with anchor text "bar":

              egrep -o "\[{2}[a-z\(\) \-\_]+\|bar\]{2}" enwiki-20081008-pages-articles-first-1gb.xml 

    Play around with this command and some variations to get familiar with the possibilities. (Pro tip: add "| sort | uniq -c" to the end of your command line to combine and count duplicates.)

    Then, please find a noun having the following properties:

    Show us how you'd map the Wikipedia senses you've discovered into WordNet senses. We're looking for mappings that look like this:

    [[bar (law)|bar]] --> 'legal_profession.n.01' (the canonical name for the synset)
    or
    [[bar (law)|bar]] --> 'the body of individuals qualified to practice law in a particular jurisdiction' (synset definition of the sort one gets from the Web interface)
    Be sure to list any Wikipedia senses that don't map to any WordNet sense, and conversely, any WordNet senses to which no Wikipedia senses are mapped.