Analyzing linguistic data: project
Steps for doing your project
Ideally, you pick a topic of your own that you are curious about. But to give you an idea of possible topics, here are a few pointers:
Some ideas from Language Log's breakfast experiments:
The statistics of real estate listings: linking real estate price to the language in the descriptions
Analyzing Twitter data: What are words that are used with a much higher proportion by women than men, and vice versa? (See this language log post.) What words tend to occur with :-) and which with :-( ?In Twitter comments on a single issue (a proposed bill, a soccer tournament), which words tend to occur with which opinion? Is it possible to detect sarcasm in tweets? See also this Language Log post
Author analysis: analyzing poems to detect who may have written them, and what characteristics they have
Online recipe collections: You could ask, for example, whether it is possible to predict the number of stars that a recipe will get from the recipe ingredients. See also Dan Jurafsky's language of food papers.
Noah Smith has a few nice datasets to analyze:
Please discuss your topic with the instructor to make sure that it is both substantial and feasible.
For your course project, you will need to apply statistical analyses yourself. Google books n-gram charts, while pretty, do not count.
Course project information
By default, course projects should be done by teams of 2 students; however, projects done by 1 or 3 students are possible with prior approval of the instructor.
Initial project description
This is a 1-2 page document (single-spaced, single column) that describes what your project will be about. It needs to contain the following information:
Motivation: Why should people be interested?
Research questions: What are the main questions that you want to answer? What are your hypotheses about what the answers will be?
Method: What statistical analyses will you use to test your hypotheses? What kind of data will you use to test them?
Data: At this point, you need to have determined that you will be able to get the data you need to run your analyses. Say what data you will use, how large your dataset is (number of words, number of relevant documents, ...), and how you will obtain the data.
This is a 1-2 page document (single-spaced, single column) that describes what the status of your project is at this point. It needs to contain the following information:
Research questions: any changes?
Method: any changes?
Describe the data that was obtained: source, size, and relevant descriptive statistics (if any)
Describe at least one statistical analysis of the data relevant to your research questions that you have already done
You also need to take into account the feedback that you got on the Initial project description.
This is a short presentation to the class. You should discuss:
Research questions and hypotheses
Why is this relevant? (Spend a lot of time on the research questions and their relevance. Describing the big picture is important!)
Data (briefly, but do talk about size and other statistics)
You will need to prepare slides for this, which you submit to the instructor ahead of time.
This is a 4-5 page document (single-spaced, single column) that describes the results of your project. It needs to contain the following information:
Brief recap: research questions and hypotheses
Data: source, size, other relevant statistics
Method: statistical analyses that you used
If you build on previous work, you need to discuss it, and give references. Published papers (at conferences, in journals) go into the references list at the end of the paper. Links to blog posts and the like go in a footnote. Also, links to websites containing data go in a footnote, not in the references list.
You need to take into account the feedback that you got on the Initial project description and Intermediate report.