Spring 2022 | Instructor: Katrin Erk | Tues/Thurs 12:30-2 | On zoom, and RLP 4.422 | Canvas
Corpus linguistics is use of text corpora for exploring, documenting and modeling linguistic phenomena. This course provides a practical introduction to working with corpora.
The purpose of this course is to provide the student with a basic toolbox for working with corpora. The student will get to know current best practice in the construction and annotation of corpora, get to know search tools for locating occurrences of relevant phenomena in a corpus, and learn to use Python, a high-level programming language, to process text corpora. We will discuss examples of corpus-creation projects and formats for annotating corpora.
This course is designed for students with no prior experience in programming. Its aim is to enable students to perform their own corpus-based studies.
Graduate students from departments other than Linguistics are welcome to take this class.
Instructor Contact Information
Office hours: Tuesdays 2-4, Fridays 11-12, via zoom. Zoom link is on Canvas.
office: RLP 4.734
contact: email katrin dot erk at utexas dot edu, or through Canvas, or on Slack
Syllabus and text
This page serves as the syllabus for this course.
There is no course textbook. Readings will be made available through links in the course schedule below.
This course provides an in-depth introduction to the construction and use of corpora for linguistic analyses, and provides the student with a collection of tools for automatic analysis.
By the end of this course, you will
be able make an informed choice of data for a corpus study,
know where to look to find data you need for your study
know how to define an appropriate annotation format for an annotation project
understand how to evaluate annotator performance
be able to write programs for extracting and interpreting corpus data using the Python programming language
be familiar with tools for automatic part-of-speech tagging, lemmatization and syntactic analysis, know how to use them in practice, and know about the state of the art in automatic processing for different languages of the world
be familiar with tools for automatically inducing word meaning representations from text, their uses and limitations
Course requirements and grading policy
Assignments (15% each): A series of 4 assignments will be given during the semester. Their purpose is to give you direct experience with the tools and techniques covered in class and the readings. Assignments will be done individually.
Project proposal draft (5%): Midway through the semester, you will propose a topic for your final project. There will be an opportunity to discuss your topic in advance during class. The proposal will be in written form and should be roughly 2 single-spaced pages.
Project progress report (5%): The progress report is a revision of the proposal, extended by at least 2 items of preliminary results. It should take into account comments given on the proposal. Expect it to require significant rewriting, as opposed to just editing the proposal. In addition, it should include an update on progress to date.It should be roughly 3 single-spaced pages.
Project final report (20%): The final report builds on the progress report and presents the project results and conclusions. It should be roughly 8 single-spaced pages in length.
Project presentation (10%): Each student will give a presentation on their project.
Attendance is not required, but will be very helpful in achieving the course goals, in particular as we will do extensive practical exercises in-class.
The course will use plus-minus grading, using the following scale (showing Grade and Percentage):
A >= 93%
A- >= 90%
B+ >= 87%
B >= 83%
B- >= 80%
C+ >= 77%
C >= 73%
C- >= 70%
D+ >= 67%
D >= 63%
D- >= 60%
If you turn in your assignment late and we have not agreed on an extension beforehand, expect points to be deducted. Extensions will be considered on a case-by-case basis. I urge you to let me know if you are in need of an extension, such that we can make sure that you get the time necessary to complete the assignments.
If an extension has not been agreed on beforehand, then for assignments, by default, 5 points (out of 100) will be deducted for lateness, plus an additional 1 point for every 24-hour period beyond 2 that the assignment is late.
Note that there are always some points to be had, even if you turn in your assignment late. The last day in the semester on which the class meets is the last day to turn in late assignments for grading. Homework assignment submitted after that date will not be graded.
Classroom safety and Covid-19
To help preserve our in person learning environment, the university recommends the following.
Academic Dishonesty Policy
You are encouraged to discuss assignments with classmates. But all written work must be your own. Students caught cheating will automatically fail the course. If in doubt, ask the instructor.
Notice about students with disabilities
The University of Texas at Austin provides upon request appropriate academic accommodations for qualified students with disabilities. Please contact the Division of Diversity and Community Engagement, Services for Students with Disabilities, 471-6259.
Notice about missed work due to religious holy days
A student who misses an examination, work assignment, or other project due to the observance of a religious holy day will be given an opportunity to complete the work missed within a reasonable time after the absence, provided that he or she has properly notified the instructor. It is the policy of the University of Texas at Austin that the student must notify the instructor at least fourteen days prior to the classes scheduled on dates he or she will be absent to observe a religious holy day. For religious holy days that fall within the first two weeks of the semester, the notice should be given on the first day of the semester. The student will not be penalized for these excused absences, but the instructor may appropriately respond if the student fails to complete satisfactorily the missed assignment or examination within a reasonable time after the excused absence.
Emergency Evacuation Policy
Occupants of buildings on The University of Texas at Austin campus are required to evacuate buildings when a fire alarm is activated. Alarm activation or announcement requires exiting and assembling outside. Familiarize yourself with all exit doors of each classroom and building you may occupy. Remember that the nearest exit door may not be the one you used when entering the building. Students requiring assistance in evacuation shall inform their instructor in writing during the first week of class. In the event of an evacuation, follow the instruction of faculty or class instructors. Do not re-enter a building unless given instructions by the following: Austin Fire Department, The University of Texas at Austin Police Department, or Fire Prevention Services office. Information regarding emergency evacuation routes and emergency procedures can be found at http://www.utexas.edu/emergency.
Senate Bill 212 and Title IX Reporting Requirements
Under Senate Bill 212 (SB 212), the professor and TAs for this course are required to report for further investigation any information concerning incidents of sexual harassment, sexual assault, dating violence, and stalking committed by or against a UT student or employee. Federal law and university policy also requires reporting incidents of sex- and gender-based discrimination and sexual misconduct (collectively known as Title IX incidents). This means we cannot keep confidential information about any such incidents that you share with us. If you need to talk with someone who can maintain confidentiality, please contact University Health Services (512-471-4955 or 512-475-6877) or the UT Counseling and Mental Health Center (512-471-3515 or 512-471-2255). We strongly urge you make use of these services for any needed support and that you report any Title IX incidents to the Title IX Office.
Adapting the class format to deal with the ongoing pandemic
The first two weeks of class will be fully online.
After that, we move to the originally planned hybrid format, where some class sessions will be fully online, and others will be offered in person (though see next bullet point).
The schedule will list clearly which days are offered in person and which are fully online.
All classes will be streamed on zoom, and students who prefer to take the class fully online will be enabled to do so.
This schedule is subject to change.
Assignments are due at the end of the day on their due date. Please submit assignments online on Canvas unless the assignment tells you otherwise.
Readings can be done either before or after class (unless noted otherwise); they are chosen to support the material covered in class.
Week 1: Jan 18, 20: This week fully online
Week 2: Jan 25, 27: This week fully online
Part 1: Introduction to programming
Week 3: Feb 1, 3: Both sessions this week in person
Tuesday: Python programming: data in data frames, and exploring data by plotting it
Thursday: Python programming: core program structure
Week 4: Feb 8, 10:
Tuesday: Python programming: core program structure, continued
Thursday: Python programming: Counting words
Week 5: Feb 15, 17:
Tuesday: Python programming: Counting words, continued
Thursday: Accessing data, and text encodings for different writing systems
Week 6: Feb 22, 24:
Part 2: Statistical analyses
Week 7: Mar 1, 3:
Week 8: Mar 8, 10:
Week 9: Spring Break
Week 10: Mar 22, 24:
Part 3: Annotation
Week 11: Mar 29, 31:
Week 12: Apr 5, 7:
Part 4: Search
Week 13: Apr 12, 14:
Tuesday: More Python: making your own functions, and structuring your programs
Thursday: More Python: manipulating data frames, and list comprehensions
Week 14: Apr 19, 21:
Part 5: Automatic linguistic analysis
Tuesday: Tools for automatic linguistic analysis
Thursday: Tools for automatic linguistic analysis, continued
Week 15: Apr 26, 28:
Week 16: May 3, 5: Project presentations
Final report due: tba.
Links and additional readings
List of software we will use in the class
Python and Python packages:
Alternatively, you can individually install:
Tips and tricks: