CS224U Homework 5 (Spring 2014)

CS224U Homework 5

This homework is designed to familiarize you with Combinatory Categorial Grammar (CCG), which is an important building block for Zettlemoyer & Collins 2005 and many other papers in semantic parsing.

For this assignment, we'll be playing around with the ccg package in NLTK, a popular and highly useful platform for doing NLP in Python. However, we don't assume any expertise in Python. Python and NLTK are already installed on the FarmShare machines (e.g. the corn machines). If you'd like to install NLTK on your own machine, you can find instructions here.

The ccg package of NLTK supports defining a CCG lexicon and parsing simple sentences. Unfortunately, it does not provide a built-in mechanism for associating semantics (such as lambda calculus representations) with items in the lexicon, so this exercise will focus solely on the operation of the syntactic categories. If you'd like to get more background on the ccg package, this page may be helpful, although it should not be necessary.

First, let's start Python and import some key modules:

$ python
Python 2.7.4 (default, Sep 26 2013, 03:20:26)
[GCC 4.7.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
from nltk.ccg import chart, lexicon

Now let's create a very simple lexicon. You can just copy and paste this into your Python session:

lex = lexicon.parseLexicon('''
:- S, NP, N
Idaho => NP
Utah => NP
borders => (S\\NP)/NP
''')

In Python, the triple tick (''') indicates a multi-line string, so the next four lines constitute the lexicon. The line beginning with :- indicates that our CCG lexicon will contain S, NP, and N as basic categories. The following lines indicate that Idaho and Utah will have category NP, while the transitive verb borders will have the compound category (S\NP)/NP. (Note that the backslash has to be written twice, because of escaping issues.)

Now we're ready to parse a simple sentence.

parser = chart.CCGChartParser(lex, chart.DefaultRuleSet)
for parse in parser.nbest_parse("Idaho borders Utah".split(), 1):
chart.printCCGDerivation(parse)
Idaho borders Utah
NP ((S\NP)/NP) NP
------------------->
(S\NP)
--------------------------<
S

Hot diggity. We're parsing with CCG.

Take some time to understand the derivation that was printed. There are two combinatory steps. First, category (S\NP)/NP combined with category NP via forward application (marked with -->) to yield category S\NP. Then, category S\NP combined with category NP via backward application (marked with --<) to yield category S.

At the end of this page is a brief appendix containing a few Python snippets which might be helpful for this exercise.

Now you're reading for some questions. Each question is worth 1 point. (10 points total.)

Question 1

Note that the second argument to the method nbest_parse() specifies how many parses to return. In the invocation above, we specified 1. How many possible parses are there altogether for the sentence Idaho borders Utah?

Question 2

One of the parses for Idaho borders Utah associates a category with the span Idaho borders, which is surprising, because the categories for Idaho and borders do not appear to be directly combinable. Which combinatory rule(s) make this possible?

Question 3

Suppose we want to parse the sentence what states border Texas. What CCG lexicon items do we need to make this work?

Question 4

Suppose we want to parse the sentence what is Texas. What CCG lexicon items do we need to make this work?

Question 5

Suppose we want to parse the sentence what is a state. What CCG lexicon items do we need to make this work?

Question 6

Suppose we want to parse the sentence what is a state that borders Texas. What CCG lexicon items do we need to make this work?

Question 7

Using the lexicon items you proposed in Question 6, how many possible parses are there for the sentence what is a state that borders Texas?

Question 8

Note that when we constructed the parser using chart.CCGChartParser(), we also specified the set of combinatory rules as chart.DefaultRuleSet. The default rule set includes four types of combinatory rules: it is equivalent to chart.ApplicationRuleSet + chart.CompositionRuleSet + chart.SubstitutionRuleSet + chart.TypeRaiseRuleSet. Examine the impact of using different combinations of rule sets on the number of possible parses for the sentence what is a state that borders Texas. Write a few sentences summarizing your findings.

Question 9

Now let's consider a slight variant on the example from Question 6: the sentence what is a state that Texas borders. Using the same set of lexicon items, and the default rule set, how many parses do you obtain?

Question 10

As in Question 8, examine the impact of using different combinations of rule sets on the number of possible parses for the sentence what is a state that Texas borders. How does the result different from your answer in Question 8? Write a few sentences explaining why it differs.

Optional bonus question!

Next Wednesday's lecture will be on the topic of "Semantic Parsing at Google". My aim is to give you a sense of how NLU research is conducted in "the real world" outside of academia, and I'll leave ample time for Q&A.

I encourage you to formulate a question related to doing NLU in industry, and to submit it along with your homework. I'll try to address common themes as part of my talk. (Of course, if you submit at the last minute, I won't see your question in advance!)

Appendix: Python snippets

Here are a few Python snippets which might be helpful for this exercise:

def print_parses(sentence, parser):
parses = parser.nbest_parse(sentence.split())
for i, parse in enumerate(parses):
print "\nParse %d:\n" % (i + 1)
chart.printCCGDerivation(parse)
def test(lex_items, sentence):
lex = lexicon.parseLexicon(":- S, NP, N\n" + lex_items)
parser = chart.CCGChartParser(lex, chart.DefaultRuleSet)
print_parses(sentence, parser)
lex_items = ''' # example usage
Idaho => NP
Utah => NP
borders => (S\\NP)/NP
'''
test(lex_items, "Idaho borders Utah")