LIN 392 Working with Corpora

Spring 2022 | Instructor: Katrin Erk | Tues/Thurs 12:30-2 | On zoom, and RLP 4.422 | Canvas

Course overview

Corpus linguistics is use of text corpora for exploring, documenting and modeling linguistic phenomena. This course provides a practical introduction to working with corpora.

The purpose of this course is to provide the student with a basic toolbox for working with corpora. The student will get to know current best practice in the construction and annotation of corpora, get to know search tools for locating occurrences of relevant phenomena in a corpus, and learn to use Python, a high-level programming language, to process text corpora. We will discuss examples of corpus-creation projects and formats for annotating corpora.

This course is designed for students with no prior experience in programming. Its aim is to enable students to perform their own corpus-based studies.

Graduate students from departments other than Linguistics are welcome to take this class.

Syllabus

Course Information

Instructor Contact Information

  • Katrin Erk

  • Office hours: Tuesdays 2-4, Fridays 11-12, via zoom. Zoom link is on Canvas.

  • office: RLP 4.734

  • contact: email katrin dot erk at utexas dot edu, or through Canvas, or on Slack

Prerequisites

Graduate standing.

Syllabus and text

This page serves as the syllabus for this course.

There is no course textbook. Readings will be made available through links in the course schedule below.

Content overview

This course provides an in-depth introduction to the construction and use of corpora for linguistic analyses, and provides the student with a collection of tools for automatic analysis.

By the end of this course, you will

  • be able make an informed choice of data for a corpus study,

  • know where to look to find data you need for your study

  • know how to define an appropriate annotation format for an annotation project

  • understand how to evaluate annotator performance

  • be able to write programs for extracting and interpreting corpus data using the Python programming language

  • be familiar with tools for automatic part-of-speech tagging, lemmatization and syntactic analysis, know how to use them in practice, and know about the state of the art in automatic processing for different languages of the world

  • be familiar with tools for automatically inducing word meaning representations from text, their uses and limitations

Course requirements and grading policy

  • Assignments (15% each): A series of 4 assignments will be given during the semester. Their purpose is to give you direct experience with the tools and techniques covered in class and the readings. Assignments will be done individually.

  • Project proposal draft (5%): Midway through the semester, you will propose a topic for your final project. There will be an opportunity to discuss your topic in advance during class. The proposal will be in written form and should be roughly 2 single-spaced pages.

  • Project progress report (5%): The progress report is a revision of the proposal, extended by at least 2 items of preliminary results. It should take into account comments given on the proposal. Expect it to require significant rewriting, as opposed to just editing the proposal. In addition, it should include an update on progress to date.It should be roughly 3 single-spaced pages.

  • Project final report (20%): The final report builds on the progress report and presents the project results and conclusions. It should be roughly 8 single-spaced pages in length.

  • Project presentation (10%): Each student will give a presentation on their project.

Attendance is not required, but will be very helpful in achieving the course goals, in particular as we will do extensive practical exercises in-class.

The course will use plus-minus grading, using the following scale (showing Grade and Percentage):

  • A >= 93%

  • A- >= 90%

  • B+ >= 87%

  • B >= 83%

  • B- >= 80%

  • C+ >= 77%

  • C >= 73%

  • C- >= 70%

  • D+ >= 67%

  • D >= 63%

  • D- >= 60%

Extension Policy

If you turn in your assignment late and we have not agreed on an extension beforehand, expect points to be deducted. Extensions will be considered on a case-by-case basis. I urge you to let me know if you are in need of an extension, such that we can make sure that you get the time necessary to complete the assignments.

If an extension has not been agreed on beforehand, then for assignments, by default, 5 points (out of 100) will be deducted for lateness, plus an additional 1 point for every 24-hour period beyond 2 that the assignment is late.

Note that there are always some points to be had, even if you turn in your assignment late. The last day in the semester on which the class meets is the last day to turn in late assignments for grading. Homework assignment submitted after that date will not be graded.

Classroom safety and Covid-19

To help preserve our in person learning environment, the university recommends the following.

Academic Dishonesty Policy

You are encouraged to discuss assignments with classmates. But all written work must be your own. Students caught cheating will automatically fail the course. If in doubt, ask the instructor.

Notice about students with disabilities

The University of Texas at Austin provides upon request appropriate academic accommodations for qualified students with disabilities. Please contact the Division of Diversity and Community Engagement, Services for Students with Disabilities, 471-6259.

Notice about missed work due to religious holy days

A student who misses an examination, work assignment, or other project due to the observance of a religious holy day will be given an opportunity to complete the work missed within a reasonable time after the absence, provided that he or she has properly notified the instructor. It is the policy of the University of Texas at Austin that the student must notify the instructor at least fourteen days prior to the classes scheduled on dates he or she will be absent to observe a religious holy day. For religious holy days that fall within the first two weeks of the semester, the notice should be given on the first day of the semester. The student will not be penalized for these excused absences, but the instructor may appropriately respond if the student fails to complete satisfactorily the missed assignment or examination within a reasonable time after the excused absence.

Emergency Evacuation Policy

Occupants of buildings on The University of Texas at Austin campus are required to evacuate buildings when a fire alarm is activated. Alarm activation or announcement requires exiting and assembling outside. Familiarize yourself with all exit doors of each classroom and building you may occupy. Remember that the nearest exit door may not be the one you used when entering the building. Students requiring assistance in evacuation shall inform their instructor in writing during the first week of class. In the event of an evacuation, follow the instruction of faculty or class instructors. Do not re-enter a building unless given instructions by the following: Austin Fire Department, The University of Texas at Austin Police Department, or Fire Prevention Services office. Information regarding emergency evacuation routes and emergency procedures can be found at http://www.utexas.edu/emergency.

Senate Bill 212 and Title IX Reporting Requirements

Under Senate Bill 212 (SB 212), the professor and TAs for this course are required to report for further investigation any information concerning incidents of sexual harassment, sexual assault, dating violence, and stalking committed by or against a UT student or employee. Federal law and university policy also requires reporting incidents of sex- and gender-based discrimination and sexual misconduct (collectively known as Title IX incidents). This means we cannot keep confidential information about any such incidents that you share with us. If you need to talk with someone who can maintain confidentiality, please contact University Health Services (512-471-4955 or 512-475-6877) or the UT Counseling and Mental Health Center (512-471-3515 or 512-471-2255). We strongly urge you make use of these services for any needed support and that you report any Title IX incidents to the Title IX Office.


Adapting the class format to deal with the ongoing pandemic

  • The first two weeks of class will be fully online.

  • After that, we move to the originally planned hybrid format, where some class sessions will be fully online, and others will be offered in person (though see next bullet point).
    The schedule will list clearly which days are offered in person and which are fully online.

  • All classes will be streamed on zoom, and students who prefer to take the class fully online will be enabled to do so.

Schedule

This schedule is subject to change.

Assignments are due at the end of the day on their due date. Please submit assignments online on Canvas unless the assignment tells you otherwise.

Readings can be done either before or after class (unless noted otherwise); they are chosen to support the material covered in class.

Week 1: Jan 18, 20: This week fully online

Week 2: Jan 25, 27: This week fully online

Part 1: Introduction to programming

  • Thursday: Python programming: fist steps

Week 3: Feb 1, 3: Both sessions this week in person

  • Tuesday: Python programming: data in data frames, and exploring data by plotting it

  • Thursday: Python programming: core program structure

Week 4: Feb 8, 10:

  • Tuesday: Python programming: core program structure, continued

    • Homework 1 due

  • Thursday: Python programming: Counting words

Week 5: Feb 15, 17:

  • Tuesday: Python programming: Counting words, continued

  • Thursday: Accessing data, and text encodings for different writing systems

Week 6: Feb 22, 24:

  • Tuesday: We discuss your course project plans in class

Part 2: Statistical analyses

  • Thursday: Some core ideas in frequentist statistics: populations and samples, reasoning under uncertainty, and hypothesis testing

Week 7: Mar 1, 3:

  • Tuesday: Hypothesis testing in practice with Python

    • Homework 2 due

  • Thursday: Some core ideas in frequentist statistics: correlation and regression

Week 8: Mar 8, 10:

  • Tuesday: Statistical analyses in practice with Python

    • Project proposal due

  • Thursday: Deciphering the output of regression models

Week 9: Spring Break

Week 10: Mar 22, 24:

Part 3: Annotation

  • Tuesday: Anontation formats

  • Thursday: Annotation formats, continued

Week 11: Mar 29, 31:

  • Tuesday Crowdsourcing, annotation quality control

    • Homework 3 due

  • Thursday: Annotation quality control

Week 12: Apr 5, 7:

Part 4: Search

  • Tuesday: Regular expressions for pattern-based search in text

    • Project progress report due

  • Thursday: Advanced regular expressions for search over part of speech and syntax annotation

Week 13: Apr 12, 14:

  • Tuesday: More Python: making your own functions, and structuring your programs

  • Thursday: More Python: manipulating data frames, and list comprehensions

Week 14: Apr 19, 21:

Part 5: Automatic linguistic analysis

  • Tuesday: Tools for automatic linguistic analysis

  • Thursday: Tools for automatic linguistic analysis, continued

Week 15: Apr 26, 28:

  • Tuesday: Making text-derived word meaning representations

    • Homework 4 due

  • Thursday: Using text derived word meaning representations

Week 16: May 3, 5: Project presentations

Final report due: tba.

Links and additional readings

List of software we will use in the class

Python and Python packages:

Alternatively, you can individually install:

Tips and tricks:

Using Python:

Corpus design: