HOMEWORK 7

CS 224U / LING 188/288, Spring 2014, Stanford

All these problems concern sentiment analysis, but the underlying issues are common wherever one is dealing with naturalistic corpora.

Question 1 [2 points]

Below are the distributions of reviews in corpora derived from Amazon.com in which each text has associated with it a star rating, 1-5 stars (1 most negative, 5 most positive).

Your task: identify one problem that the nature of these distributions might cause for a classifier predicting the rating attached to a given text. (1-3 sentence response.)

Amazon reviews
	Chinese	English	German	Japanese
1-star	29,642	39,383	2,984	3,973
2-star	32,602	48,455	1,880	4,166
3-star	100,272	90,528	2,646	8,708
4-star	160,817	148,260	4,427	18,960
5-star	204,461	237,839	15,774	43,331

CSV version of the above table and the table in question 3

Question 2 [2 points]

One common response to major class imbalances like the above is to artificially balance the training/testing data, by sampling from the categories.

Your task: identify two problems that this might pose for training and/or testing on real data. (2-4 sentence response.)

Question 3 [2 points]

Here is the distribution of reviews in a corpus derived from the website RateBeer. Whereas Amazon.com is an online store, RateBeer is basically a social-networking site on which beer enthusiasts share information (via short reviews and ratings) about beers. The site members vary greatly in their expertise about beer, they interact with each other a lot, they tend to write a lot of reviews, and their tastes evolve over time.

	RateBeer
1-star	74,508
2-star	196,397
3-star	722,797
4-star	1,547,707
5-star	382,754

Your task: given what you know about Amazon.com and RateBeer.com, offer two conjectures about why the overall rating distributions might be so different (comparing the tables in problem 1 with the above table). Assume that the differences are not due to problematic sampling from the sites' overall data. (1-3 sentence response.)

Question 4 [2 points]

The file imdb-advadj-with-ratings.csv.zip contains 91,713 adverb–adjective pairs from the short summaries attached to user-supplied movie reviews at IMDB.com. The file can be read with any program that reads tabular data (Excel, OpenOffice, R, SPSS, ...). The format is as follows (with all values comma-separated and no value containing a comma itself):

Column 1: raw rating (1-10)
Column 2: collapses the ratings into three categories: for rating R, if R ≥ 8, then Pos; if R ≤ 3, then Neg, else Neutral
Column 3: the adverb
Column 4: the adjective
Column 5: the classification of the adjective according to the Harvard Inquirer: Positiv, Negativ, or Objectiv (if the adjective is listed as neither Positiv nor Negativ).

Here's a sample of the format:

Rating	Polarity	Adverb	Adjective	AdjPolarity
8	Pos	really	knowing	Objectiv
5	Neutral	nearly	identical	Objectiv
3	Neg	amazingly	dull	Negativ
7	Neutral	not	bad	Negativ
8	Pos	especially	great	Positiv
10	Pos	utterly	charming	Objectiv

Your task: The feature AdjPolarity alone does modestly well at predicting Polarity (micro-averaged precision/recall/F1 = 0.45). Propose two additional features that you think can improve performance. Exceptional answers will actually test out the proposed features.

Note: The raw star ratings are off-limits as features! Feel free to ignore them. We kept them in case you are feeling particularly ambitious.

Note: This is a real-world data set. There might be some junk adverb–adjective pairs in the mix, and the file inherits the rating-scale imbalances of the larger corpus (similar to what we saw in question 1).

Question 5 [2 points]

Many algorithms for building large sentiment lexicons classify simple strings, ignoring grammatical information like part of speech as well as contextual information about where and how the strings were used. Thus, they will likely miss the reliable sentiment contrast between the adjective gross (as in yucky) and the noun gross (as in profits), and the fact that sensitive is likely positive when it describes a novelist but negative when it describes a bruise.

Your task: Think up four examples in which grammatical information or contextual information is important for sentiment classification. For each one, include an example sentence that highlights the contrast you identified. If your examples are from a language other than English, please provide English glosses. (Non-English data are encouraged!)