Analyzing linguistic data: project
Steps for doing your project
Possible topics
Ideally, you pick a topic of your own that you are curious about. But to give you an idea of possible topics, here are a few pointers:
Some ideas from Language Log's breakfast experiments:
Which words are used to describe white and black NFL prospects? Links here, here (data for download in the 2nd link)
State of the Union: what are signature words of Obama, of earlier presidents? (And why?)
The statistics of real estate listings: linking real estate price to the language in the descriptions
Analyzing Twitter data: What are words that are used with a much higher proportion by women than men, and vice versa? (See this language log post.) What words tend to occur with :-) and which with :-( ?In Twitter comments on a single issue (a proposed bill, a soccer tournament), which words tend to occur with which opinion? Is it possible to detect sarcasm in tweets? See also this Language Log post
Author analysis: analyzing poems to detect who may have written them, and what characteristics they have
Online recipe collections: You could ask, for example, whether it is possible to predict the number of stars that a recipe will get from the recipe ingredients. See also Dan Jurafsky's language of food papers.
Noah Smith has a few nice datasets to analyze:
Movie corpus: predicting movie revenue from review texts
Congressional bill corpus: predicting whether a bill will survive from the text in the bill
Corporate reports corpus: predicting how well a company will do from the annual reports that it issues
Please discuss your topic with the instructor to make sure that it is both substantial and feasible.
For your course project, you will need to apply statistical analyses yourself. Google books n-gram charts, while pretty, do not count.
Course project information
By default, course projects should be done by teams of 2 students; however, projects done by 1 or 3 students are possible with prior approval of the instructor.
Initial project description
This is a 1-2 page document (single-spaced, single column) that describes what your project will be about. It needs to contain the following information:
Motivation: Why should people be interested?
Research questions: What are the main questions that you want to answer? What are your hypotheses about what the answers will be?
Method: What statistical analyses will you use to test your hypotheses? What kind of data will you use to test them?
Data: At this point, you need to have determined that you will be able to get the data you need to run your analyses. Say what data you will use, how large your dataset is (number of words, number of relevant documents, ...), and how you will obtain the data.
Intermediate report
This is a 1-2 page document (single-spaced, single column) that describes what the status of your project is at this point. It needs to contain the following information:
Research questions: any changes?
Method: any changes?
Status:
Describe the data that was obtained: source, size, and relevant descriptive statistics (if any)
Describe at least one statistical analysis of the data relevant to your research questions that you have already done
You also need to take into account the feedback that you got on the Initial project description.
Short presentation
This is a short presentation to the class. You should discuss:
Research questions and hypotheses
Why is this relevant? (Spend a lot of time on the research questions and their relevance. Describing the big picture is important!)
Data (briefly, but do talk about size and other statistics)
Results
You will need to prepare slides for this, which you submit to the instructor ahead of time.
Final report
This is a 4-5 page document (single-spaced, single column) that describes the results of your project. It needs to contain the following information:
Brief recap: research questions and hypotheses
Data: source, size, other relevant statistics
Method: statistical analyses that you used
Findings
If you build on previous work, you need to discuss it, and give references. Published papers (at conferences, in journals) go into the references list at the end of the paper. Links to blog posts and the like go in a footnote. Also, links to websites containing data go in a footnote, not in the references list.
You need to take into account the feedback that you got on the Initial project description and Intermediate report.