LIN 392 Working with Corpora
Corpus linguistics is use of text corpora for exploring, documenting and modeling linguistic phenomena. This course provides a practical introduction to working with corpora.
The purpose of this course is to provide the student with a basic toolbox for working with corpora. The student will get to know current best practice in the construction and annotation of corpora, get to know search tools for locating occurrences of relevant phenomena in a corpus, and learn to use Python, a high-level programming language, to process text corpora. We will discuss examples of corpus-creation projects and formats for annotating corpora.
This course is designed for students with no prior experience in programming. Its aim is to enable students to perform their own corpus-based studies.
Graduate students from departments other than Linguistics are welcome to take this class.
Adapting the class format to deal with the ongoing pandemic
Week 1: Jan 18, 20: This week fully online
Tuesday: Introduction and course overview
Thursday: Working with corpora: a lightning tour; corpora and resources
Week 2: Jan 25, 27: This week fully online
Tuesday: An overview of existing corpora and resources
Part 1: Introduction to programming
Thursday: Python programming: fist steps
Code for download: First steps in Python
Note: This file is a Jupyter notebook, extension .ipynb. If you download this on Windows, you may get an error message that Windows didn't know what program to use to open this file. Never mind. It still puts the file into your downloads folder. Open Anaconda and, within Anaconda, Jupyter notebooks. Navigate to your Downloads folder. Open the notebook from within Anaconda.
Code for fownload: Working with Pandas
Week 3: Feb 1, 3: Tuesday session this week in person
Tuesday: Python programming: data in data frames, and exploring data by plotting it
Thursday: No class, inclement weather
Week 4: Feb 8, 10: This week in person.
Tuesday: Python programming: core program structure
Code for download: Conditions, lists, and loops
Thursday: Python programming: Conditions, lists, and loops, continued. Then: Counting words
Week 5: Feb 15, 17: This week in person.
Tuesday: Python programming: Counting words
Thursday: Accessing data, and text encodings for different writing systems
Code for download: Accessing text data
Week 6: Feb 22, 24:
Tuesday: We discuss your course project plans in class
Part 2: Statistical analyses
Thursday: Finishing up our Python programming unit:
Week 7: Mar 1, 3:
Tuesday: Hypothesis testing
Code for download: Hypothesis testing: the IQ example illustrated
Thursday: Hypothesis testing in practice with Python
Week 8: Mar 8, 10:
Tuesday: Hypothesis testing, continued. Then: Some core ideas in frequentist statistics: correlation and regression
Thursday: Statistical analyses in practice with Python
Week 9: Spring Break
Week 10: Mar 22, 24:
Tuesday: Regression continued: Linear regression with multiple predictors, and logistic regression
Thursday: Regression wrap-up: logistic regression and model comparison
Week 11: Mar 29, 31:
Part 3: Annotation
Tuesday Annotation formats
Thursday: Crowdsourcing, Annotation quality control
Homework 3 due
Links about crowdsourcing:
The first study on the quality of crowdsourced linguistic annotation: Snow et al 2008, "Cheap and Fast — But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks"
The demographics of Amazon Mechanical Turk: Difallah et al, "Demographics and Dynamics of Mechanical Turk Workers"
Crowdsourcing and ethics: an Atlantic article about the misery of crowdsourcing work
Week 12: Apr 5, 7:
Part 4: Search
Tuesday: Regular expressions for pattern-based search in text
Thursday: Advanced regular expressions for search over part of speech and syntax annotation
Week 13: Apr 12, 14:
Tuesday: More Python: making your own functions, and structuring your programs
Thursday: Using textual context as a proxy for meaning. This session available as a panopto recording.
Panopto recordings on Canvas: distributional models parts 1 through 3
Slides on Canvas
Week 14: Apr 19, 21:
Part 5: Automatic linguistic analysis
Tuesday: Textual context as a proxy for meaning in the digital humanities. This session available as a panopto recording.
Thursday: More Python: manipulating data frames, and list comprehensions
Code for download: topic modeling
Week 15: Apr 26, 28:
Tuesday: Python: complex objects. plus: manipulating data frames
Homework 4 due
Week 16: May 3, 5: Project presentations
12:30 Ellis Davenport
12:55 Sarah Ransom-Laud
1:20 Ethan Warren
12:30 Gabriela O'Connor
12:55 Sooyong Lee
1:20 Haleigh Wallace
Final report due: May 11 end of day.
Links and additional readings
Tips and tricks: