Analyzing linguistic data: schedule
This schedule is subject to change.
Assignments are due at class time (11am) on their due date. Please submit assignments online on Canvas unless the assignment tells you otherwise.
Readings can be done either before or after class (unless noted otherwise); they are chosen to support the material covered in class.
Week 1
Jan 19 Introduction
Analyzing text to gain insight about language, and about the people who use it:
Foundations of programming
Jan 21 Programming first steps
Week 2
Jan 26 Exploring and visualizing data
We use the inaugural speeches dataset available on Canvas as inaugural.csv. Please put it in the same directory as your Jupyter notebooks.
Jan 28 We continue on exploring and visualizing data.
Code: same worksheet as Jan 26.
Week 3
Feb 2 Programming basics: conditions, lists, and loops.
Feb 4 Programming basics: conditions, lists, and loops, continued
Week 4
Text processing
Feb 9 Programming basics: Dictionaries/maps
Homework 1 due
Feb 11 Counting words, and tools for text preprocessing
Week 5
Feb 16 Class canceled due to inclement weather.
Feb 18 Class canceled due to inclement weather
Week 6
Feb 23: Class canceled due to inclement weather
Material we did not cover synchronously in class
Project ideas: We discuss your course project ideas on the "course-project-planning" channel on slack. See you there!
Jupyter notebooks:
Summarizing and exploring data
Feb 25 descriptive statistics: central tendency and spread.
Readings: SE chapter 2
Do women talk more than men do? Here are some relevant Language Log posts:
http://itre.cis.upenn.edu/~myl/languagelog/archives/003420.html
http://itre.cis.upenn.edu/~myl/languagelog/archives/003607.html
Code for download: central tendency and spread
We use the Fisher telephone data available on Canvas as fisher1.tbl
Week 7
Mar 2 Separating the wheat from the chaff: Identifying important words
Mar 4 Clustering: Automatically grouping data for exploratory data analysis
Week 8
Mar 9 More clustering techniques
Initial project description due.
Mar 11 Discussion session What can we find out about people from what they say? And what should we try to find out?
Readings we'll use in class:
Homework 2 due.
Week 9
Spring break
Week 10
Risky conclusions: Hypothesis testing
Mar 23 Samples and populations, probability distributions, and the central limit theorem
Readings: SE ch. 3, 4, 5
Cartoon Introduction to Statistics, available on Canvas:
Samples and populations: 1-cartoon-sampling
Confounder variables: 2-cartoon-two-variables
The Central Limit theorem: 3-cartoon-central-limits
Applying the Central Limit theorem: 4-cartoon-applying-central-limits
Mar 25 Hypothesis testing
Readings: SE ch. 6, 7, 8
Pitfalls of statistical analysis: The problem of multiple testing https://xkcd.com/882/
Cartoon Introduction to Statistics, available on Canvas: Hypothesis testing: 5_cartoon_hypothesistesting
Week 11
Mar 30 A hypothesis test that is particularly useful for text data: chi-squared
Reading: SE ch. 19
Code for download: the chi-squared test
More programming
April 1 Python list comprehensions, and how to use them with pandas
Intermediate project report due.
Week 12
April 6: Python: defining your own functions, and structuring your programs. Python exceptions
Correlation
April 8 Correlation and linear regression
Code for download: correlation.
Readings: SE ch. 20, 261-70
Language Log on correlation and causation
And another note on correlation and causation: "Nobel prize winners eat more chocolate"??
While we are on the topic of how hard it is to draw good risky conclusions: "p-value hacking"
Week 13
April 13 More linear regression
Code for download: linear regression with Python statsmodels
Readings; SE rest of ch. 20, ch. 21
Homework 3 due
April 15 Multiple regression, and practicing regression
Week 14
April 20 Logistic regression
April 23 What is the best characterization of my data? Model comparison
Week 15
April 27 Practicing regression
Homework 4 due
April 29 What is the connection between ANOVA and regression?
Reading: SE ch. 11
Week 16
Project presentations
May 4: In person meeting: Project presentations
You can also present your project via zoom. Do what is safe for you.
We have 7 minutes for each group.
Schedule (made using Python's random.shuffle()):
11:00 Richard McCanlies
11:07 Teddy Mutiga
11:14 Vittoria Byland
11:21 Gonsala Chavez
11:28 Prachi Shah
11:35 Hayden Shaw
11:42 Francesco Leone
11:49 Sunny Ananthanarayan
11:56 Eliza Anzualda and Kristen Shotton
May 6: In person meeting: Project presentations
You can also present your project via zoom. Do what is safe for you.
Schedule:
11:00 Yuhao Dai
11:07 Angelo Ganichaux
11:14 Grace Kim
11:21 Grey Sandstrum and Galina Bouyer
11:28 Kinda Nahas
11:35 Eddie Castillo and Austin Rinn
11:42 McGhee and Autumn Spalding
11:49 Michael Sullivan
Final report due: Sunday May 16, end of day.