LIN 392 Working with Corpora
Spring 2022 | Instructor: Katrin Erk | Tues/Thurs 12:30-2 | On zoom, and RLP 4.422 | Canvas
Course overview
Corpus linguistics is use of text corpora for exploring, documenting and modeling linguistic phenomena. This course provides a practical introduction to working with corpora.
The purpose of this course is to provide the student with a basic toolbox for working with corpora. The student will get to know current best practice in the construction and annotation of corpora, get to know search tools for locating occurrences of relevant phenomena in a corpus, and learn to use Python, a high-level programming language, to process text corpora. We will discuss examples of corpus-creation projects and formats for annotating corpora.
This course is designed for students with no prior experience in programming. Its aim is to enable students to perform their own corpus-based studies.
Graduate students from departments other than Linguistics are welcome to take this class.
Syllabus
Course Information
Course: Working with Corpora, LIN392, unique number 39705
Semester: Spring 2022
Course page: https://www.katrinerk.com/courses/working-with-corpora
Course times: Tuesday, Thursday 12:30-2
Course location: online via zoom, with links on Canvas. On in-person days, RLP 4.422
Course on Canvas: https://utexas.instructure.com/courses/1331761
Instructor Contact Information
Office hours: Tuesdays 2-4, Fridays 10-11, via zoom. Zoom link is on Canvas.
office: RLP 4.734
contact: email katrin dot erk at utexas dot edu, or through Canvas, or on Slack
Prerequisites
Graduate standing.
Syllabus and text
This page serves as the syllabus for this course.
There is no course textbook. Readings will be made available through links in the course schedule below.
Content overview
This course provides an in-depth introduction to the construction and use of corpora for linguistic analyses, and provides the student with a collection of tools for automatic analysis.
By the end of this course, you will
be able make an informed choice of data for a corpus study,
know where to look to find data you need for your study
know how to define an appropriate annotation format for an annotation project
understand how to evaluate annotator performance
be able to write programs for extracting and interpreting corpus data using the Python programming language
be familiar with tools for automatic part-of-speech tagging, lemmatization and syntactic analysis, know how to use them in practice, and know about the state of the art in automatic processing for different languages of the world
be familiar with tools for automatically inducing word meaning representations from text, their uses and limitations
Course requirements and grading policy
Assignments (15% each): A series of 4 assignments will be given during the semester. Their purpose is to give you direct experience with the tools and techniques covered in class and the readings. Assignments will be done individually.
Project proposal draft (5%): Midway through the semester, you will propose a topic for your final project. There will be an opportunity to discuss your topic in advance during class. The proposal will be in written form and should be roughly 2 single-spaced pages.
Project progress report (5%): The progress report is a revision of the proposal, extended by at least 2 items of preliminary results. It should take into account comments given on the proposal. Expect it to require significant rewriting, as opposed to just editing the proposal. In addition, it should include an update on progress to date.It should be roughly 3 single-spaced pages.
Project final report (20%): The final report builds on the progress report and presents the project results and conclusions. It should be roughly 8 single-spaced pages in length.
Project presentation (10%): Each student will give a presentation on their project.
Attendance is not required, but will be very helpful in achieving the course goals, in particular as we will do extensive practical exercises in-class.
The course will use plus-minus grading, using the following scale (showing Grade and Percentage):
A >= 93%
A- >= 90%
B+ >= 87%
B >= 83%
B- >= 80%
C+ >= 77%
C >= 73%
C- >= 70%
D+ >= 67%
D >= 63%
D- >= 60%
Extension Policy
If you turn in your assignment late and we have not agreed on an extension beforehand, expect points to be deducted. Extensions will be considered on a case-by-case basis. I urge you to let me know if you are in need of an extension, such that we can make sure that you get the time necessary to complete the assignments.
If an extension has not been agreed on beforehand, then for assignments, by default, 5 points (out of 100) will be deducted for lateness, plus an additional 1 point for every 24-hour period beyond 2 that the assignment is late.
Note that there are always some points to be had, even if you turn in your assignment late. The last day in the semester on which the class meets is the last day to turn in late assignments for grading. Homework assignment submitted after that date will not be graded.
Classroom safety and Covid-19
To help preserve our in person learning environment, the university recommends the following.
Adhere to university mask guidance.
Vaccinations are widely available, free and not billed to health insurance. The vaccine will help protect against the transmission of the virus to others and reduce serious symptoms in those who are vaccinated.
Proactive Community Testing remains an important part of the university’s efforts to protect our community. Tests are fast and free.
Visit protect.utexas.edu for more information
Academic Dishonesty Policy
You are encouraged to discuss assignments with classmates. But all written work must be your own. Students caught cheating will automatically fail the course. If in doubt, ask the instructor.
Notice about students with disabilities
The University of Texas at Austin provides upon request appropriate academic accommodations for qualified students with disabilities. Please contact the Division of Diversity and Community Engagement, Services for Students with Disabilities, 471-6259.
Notice about missed work due to religious holy days
A student who misses an examination, work assignment, or other project due to the observance of a religious holy day will be given an opportunity to complete the work missed within a reasonable time after the absence, provided that he or she has properly notified the instructor. It is the policy of the University of Texas at Austin that the student must notify the instructor at least fourteen days prior to the classes scheduled on dates he or she will be absent to observe a religious holy day. For religious holy days that fall within the first two weeks of the semester, the notice should be given on the first day of the semester. The student will not be penalized for these excused absences, but the instructor may appropriately respond if the student fails to complete satisfactorily the missed assignment or examination within a reasonable time after the excused absence.
Emergency Evacuation Policy
Occupants of buildings on The University of Texas at Austin campus are required to evacuate buildings when a fire alarm is activated. Alarm activation or announcement requires exiting and assembling outside. Familiarize yourself with all exit doors of each classroom and building you may occupy. Remember that the nearest exit door may not be the one you used when entering the building. Students requiring assistance in evacuation shall inform their instructor in writing during the first week of class. In the event of an evacuation, follow the instruction of faculty or class instructors. Do not re-enter a building unless given instructions by the following: Austin Fire Department, The University of Texas at Austin Police Department, or Fire Prevention Services office. Information regarding emergency evacuation routes and emergency procedures can be found at http://www.utexas.edu/emergency.
Senate Bill 212 and Title IX Reporting Requirements
Under Senate Bill 212 (SB 212), the professor and TAs for this course are required to report for further investigation any information concerning incidents of sexual harassment, sexual assault, dating violence, and stalking committed by or against a UT student or employee. Federal law and university policy also requires reporting incidents of sex- and gender-based discrimination and sexual misconduct (collectively known as Title IX incidents). This means we cannot keep confidential information about any such incidents that you share with us. If you need to talk with someone who can maintain confidentiality, please contact University Health Services (512-471-4955 or 512-475-6877) or the UT Counseling and Mental Health Center (512-471-3515 or 512-471-2255). We strongly urge you make use of these services for any needed support and that you report any Title IX incidents to the Title IX Office.
Adapting the class format to deal with the ongoing pandemic
The first two weeks of class will be fully online.
After that, we move to the originally planned hybrid format, where some class sessions will be fully online, and others will be offered in person (though see next bullet point).
The schedule will list clearly which days are offered in person and which are fully online.All classes will be streamed on zoom, and students who prefer to take the class fully online will be enabled to do so.
Schedule
This schedule is subject to change.
Assignments are due at the end of the day on their due date. Please submit assignments online on Canvas unless the assignment tells you otherwise.
Readings can be done either before or after class (unless noted otherwise); they are chosen to support the material covered in class.
Week 1: Jan 18, 20: This week fully online
Tuesday: Introduction and course overview
Thursday: Working with corpora: a lightning tour; corpora and resources
Week 2: Jan 25, 27: This week fully online
Tuesday: An overview of existing corpora and resources
We continue with the slideset from last week
Links for the resources we discuss today:
NLTK resources: Chapter 2 of the NLTK book
VerbNet: current edition, and new edition
Part 1: Introduction to programming
Thursday: Python programming: fist steps
Code for download: First steps in Python
Note: This file is a Jupyter notebook, extension .ipynb. If you download this on Windows, you may get an error message that Windows didn't know what program to use to open this file. Never mind. It still puts the file into your downloads folder. Open Anaconda and, within Anaconda, Jupyter notebooks. Navigate to your Downloads folder. Open the notebook from within Anaconda.
Code for fownload: Working with Pandas
Week 3: Feb 1, 3: Tuesday session this week in person
Tuesday: Python programming: data in data frames, and exploring data by plotting it
We continue with the worksheet on Working with Pandas
Thursday: No class, inclement weather
Week 4: Feb 8, 10: This week in person.
Tuesday: Python programming: core program structure
Code for download: Conditions, lists, and loops
Thursday: Python programming: Conditions, lists, and loops, continued. Then: Counting words
Code for download: Conditions, lists, and loops
Code for download: Counting words
Homework 1 due
Week 5: Feb 15, 17: This week in person.
Tuesday: Python programming: Counting words
We finish up the notebook on Counting words
Code for download: Transferring word counts into Pandas dataframes
Thursday: Accessing data, and text encodings for different writing systems
Code for download: Accessing text data
Week 6: Feb 22, 24:
Tuesday: We discuss your course project plans in class
Part 2: Statistical analyses
Thursday: Finishing up our Python programming unit:
Finishing the worksheet on Transferring word counts into Pandas dataframes
Discussing the worksheet about Accessing text data
Some core ideas in frequentist statistics: populations and samples, central tendency and spread ,reasoning under uncertainty
Code for download: Central tendency and spread
Week 7: Mar 1, 3:
Tuesday: Hypothesis testing
Code for download: Hypothesis testing: the IQ example illustrated
Thursday: Hypothesis testing in practice with Python
Code for download: probability distributions and Python
Code for download: Hypothesis testing: the t-test
Homework 2 due
Week 8: Mar 8, 10:
Tuesday: Hypothesis testing, continued. Then: Some core ideas in frequentist statistics: correlation and regression
Code for download: Hypothesis testing: the chi-squared test
Code for download: Correlation
Code for download: Linear regression
Project proposal due
Thursday: Statistical analyses in practice with Python
Code for download: More linear regression
Code for download: Logistic regression
Code for download: Model comparison
Week 9: Spring Break
Week 10: Mar 22, 24:
Tuesday: Regression continued: Linear regression with multiple predictors, and logistic regression
Code for download: More linear regression
Code for download: Logistic regression
Thursday: Regression wrap-up: logistic regression and model comparison
Code for download: Logistic regression
Code for download: Model comparison
Week 11: Mar 29, 31:
Part 3: Annotation
Tuesday Annotation formats
Thursday: Crowdsourcing, Annotation quality control
Homework 3 due
Links about crowdsourcing:
The first study on the quality of crowdsourced linguistic annotation: Snow et al 2008, "Cheap and Fast — But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks"
The demographics of Amazon Mechanical Turk: Difallah et al, "Demographics and Dynamics of Mechanical Turk Workers"
Crowdsourcing and ethics: an Atlantic article about the misery of crowdsourcing work
Week 12: Apr 5, 7:
Part 4: Search
Tuesday: Regular expressions for pattern-based search in text
Thursday: Advanced regular expressions for search over part of speech and syntax annotation
Slides on Canvas
the OPUS collection of online corpora: https://opus.nlpl.eu/
Project progress report due
Week 13: Apr 12, 14:
Tuesday: More Python: making your own functions, and structuring your programs
Code for download: functions in Python
Code for download: Python list comprehensions
Thursday: Using textual context as a proxy for meaning. This session available as a panopto recording.
Panopto recordings on Canvas: distributional models parts 1 through 3
Slides on Canvas
Week 14: Apr 19, 21:
Part 5: Automatic linguistic analysis
Tuesday: Textual context as a proxy for meaning in the digital humanities. This session available as a panopto recording.
Panopto recording on Canvas: using distributional models
Code for download: using pre-computed semantic spaces
Code for download: computing your own semantic spaces
Thursday: More Python: manipulating data frames, and list comprehensions
Code for download: topic modeling
Week 15: Apr 26, 28:
Tuesday: Python: complex objects. plus: manipulating data frames
Code for download: Manipulating pandas data frames
In case you need a template for how to access a corpus made of many files in the same directory: code for download
Thursday:
More on complex objects in Python: We inspect NLTK's FreqDist
Whose data? We start the discussion with Jo and Gebru, Lessons from Archives (You are not expected to read this article ahead of class)
Homework 4 due
Week 16: May 3, 5: Project presentations
Tuesday:
12:30 Ellis Davenport
12:55 Sarah Ransom-Laud
1:20 Ethan Warren
Thursday:
12:30 Gabriela O'Connor
12:55 Sooyong Lee
1:20 Haleigh Wallace
Final report due: May 11 end of day.
Links and additional readings
List of software we will use in the class
Python and Python packages:
We strongly recommend installing Anaconda, as that includes Python along with all Python packages we need.
If you install anaconda, you will have to add gensim. Here is a tutorial on how to add a package to Anaconda: https://docs.anaconda.com/anaconda/navigator/tutorials/manage-packages/#installing-a-package
Alternatively, you can individually install:
Python: https://www.python.org/downloads/ Any version >= 3.4 should be fine.
pandas: https://pandas.pydata.org/
numpy: https://numpy.org/
matplotlib: https://matplotlib.org/
statsmodels: https://www.statsmodels.org/stable/index.html
NLTK: Installing NLTK itself: http://www.nltk.org/install.html You also need the NLTK data, see http://www.nltk.org/data.html
Tips and tricks:
Using Python:
Corpus design: