# Schedule: words in a haystack

This schedule is subject to change.

Assignments are due at the end of their due date (midnight).

Readings can be done either before or after class; they are chosen to support the material covered in class.

### Week 1

Aug 27: Introduction

Links:

Counting politicians' words: SOTU addresses over the years, SOTU pronouns, more presidential pronouns, presidential pronouns again (There was also at some point a nice tally of the rate at which different candidates particular words and parts of speech, but it has disappeared. Tell me if you find it.)

### Week 2

### Week 3

Sep 9: Descriptive statistics: central tendency and spread.

Readings: SE chapters 2

For computing the mode of a sample, see the R code snippets page

Data: Fisher telephone conversation corpus: table 1, table 2

Do women talk more than men do? Here are some relevant Language Log posts:

http://itre.cis.upenn.edu/~myl/languagelog/archives/003420.html

http://itre.cis.upenn.edu/~myl/languagelog/archives/003607.html

Sep 11: Descriptive statistics continued

### Week 4

Sep 15:

*Invited talk: Statistics in practice*Sep 17:

*Invited talk: Statistics in practice***Homework 1 due**

### Week 5

Sep 22: Obtaining your data: Getting text into R

Sep 24: Obtaining your data, continued

### Week 6

Sep 29: Probability distributions.

Readings: SE ch. 3, ALD ch. 3 pp. 44-63

**We talk in class about your project ideas**Oct 1: Probability distributions continued. Samples and populations.

### Week 7

Oct 6: Statistical tests: the basic principle; significance thresholds. Also: starting on the t-test

Readings: SE ch. 4,5, 6

**Homework 2 due**Oct 8: The t-test, continued. Also confidence intervals

Readings: SE ch. 7, 8, 9

Data: Fisher word counts, Fisher telephone corpus, analyzed using LIWC (warning: big dataset!), meta-data for the Fisher telephone corpus

**Initial project descriptions due**

### Week 8

Oct 13: Comparing more than two data sets: ANOVA

Pitfalls of statistical analysis: The problem of multiple testing https://xkcd.com/882/

Readings: SE ch. 11, 12

Oct 15: Correlation and linear regression

Readings: SE ch. 20, 261-70

Language Log on correlation and causation

And another note on correlation and causation: "Nobel prize winners eat more chocolate"??

### Week 9

Oct 20: more linear regression

R code: more linear regression

Readings: ALD ch. 4 pp. 84 - 10; SE rest of ch. 20, ch. 21

Oct 22: Logistic regression

### Week 10

Oct 27: Logistic regression in R

Readings: ALD ch 6 pp 195-199, 202-203

Oct 29: Model comparison

Readings: ALD ch. 6 pp 174-188

**Homework 3 due**

### Week 11

Nov 3: Practising linear and logistic regression

Nov 5: Statistical testing: chi-squared

Readings: SE ch. 19

**Intermediate project reports due**

### Week 12

:Nov 10: Classification

Nov 12: Classification continued

Data: Wine reviews and prices

### Week 13

Nov 17: Clustering for data exploration

Readings: ALD ch. 5 pp 138-148

Nov 19: Topic modeling

**Homework 4 due**

### Week 14

Nov 24: Statistical pitfalls to avoid

Nov 26:

*Thanksgiving*

### Week 15

Dec 1:

**Final report due: Monday December 14, 2015, end of day**