Schedule: words in a haystack
This schedule is subject to change.
Assignments are due at the end of their due date (midnight).
Readings can be done either before or after class; they are chosen to support the material covered in class.
Week 1
Aug 27: Introduction
Links:
Counting politicians' words: SOTU addresses over the years, SOTU pronouns, more presidential pronouns, presidential pronouns again (There was also at some point a nice tally of the rate at which different candidates particular words and parts of speech, but it has disappeared. Tell me if you find it.)
Week 2
Week 3
Sep 9: Descriptive statistics: central tendency and spread.
Readings: SE chapters 2
For computing the mode of a sample, see the R code snippets page
Data: Fisher telephone conversation corpus: table 1, table 2
Do women talk more than men do? Here are some relevant Language Log posts:
http://itre.cis.upenn.edu/~myl/languagelog/archives/003420.html
http://itre.cis.upenn.edu/~myl/languagelog/archives/003607.html
Sep 11: Descriptive statistics continued
Week 4
Sep 15: Invited talk: Statistics in practice
Sep 17: Invited talk: Statistics in practice
Homework 1 due
Week 5
Sep 22: Obtaining your data: Getting text into R
Sep 24: Obtaining your data, continued
Week 6
Sep 29: Probability distributions.
Readings: SE ch. 3, ALD ch. 3 pp. 44-63
We talk in class about your project ideas
Oct 1: Probability distributions continued. Samples and populations.
Week 7
Oct 6: Statistical tests: the basic principle; significance thresholds. Also: starting on the t-test
Readings: SE ch. 4,5, 6
Homework 2 due
Oct 8: The t-test, continued. Also confidence intervals
Readings: SE ch. 7, 8, 9
Data: Fisher word counts, Fisher telephone corpus, analyzed using LIWC (warning: big dataset!), meta-data for the Fisher telephone corpus
Initial project descriptions due
Week 8
Oct 13: Comparing more than two data sets: ANOVA
Pitfalls of statistical analysis: The problem of multiple testing https://xkcd.com/882/
Readings: SE ch. 11, 12
Oct 15: Correlation and linear regression
Readings: SE ch. 20, 261-70
Language Log on correlation and causation
And another note on correlation and causation: "Nobel prize winners eat more chocolate"??
Week 9
Oct 20: more linear regression
R code: more linear regression
Readings: ALD ch. 4 pp. 84 - 10; SE rest of ch. 20, ch. 21
Oct 22: Logistic regression
Week 10
Oct 27: Logistic regression in R
Readings: ALD ch 6 pp 195-199, 202-203
Oct 29: Model comparison
Readings: ALD ch. 6 pp 174-188
Homework 3 due
Week 11
Nov 3: Practising linear and logistic regression
Nov 5: Statistical testing: chi-squared
Readings: SE ch. 19
Intermediate project reports due
Week 12
:Nov 10: Classification
Nov 12: Classification continued
Data: Wine reviews and prices
Week 13
Nov 17: Clustering for data exploration
Readings: ALD ch. 5 pp 138-148
Nov 19: Topic modeling
Homework 4 due
Week 14
Nov 24: Statistical pitfalls to avoid
Nov 26: Thanksgiving
Week 15
Dec 1: Project presentations
14:00 Weisz
14:09 Cavasso/Ebmeier
14:18 Reinhard/Idonor
14:27 Donohoe
14:36 Bortner
14:45 Chew/Nelson
14:54 Dang/Zuniga
15:03 Barajas/Kuhn
Dec 3: Project presentations
14:00 Moran
14:09 Radpour
14:18 Charles
14:27 Fofliger
14:36 Broomfield/Gavino
14:45 Adams/Walker
14:54 Paulter
Final report due: Monday December 14, 2015, end of day