Analyzing linguistic data, and programming for linguists
Spring 2022 | Instructor: Katrin Erk| TTH 11-12:30 | Hybrid: zoom, and WAG 208 | Canvas
Course Overview
Today, huge amounts of text are available in electronic form. We can poke these electronic text collections to answer questions about language -- and questions about the people who use it. For example, we can test whether passive constructions are increasingly falling out of favor in English, and we can trace how words change their meaning over time. We can also study a politician's word choices in political debates to find out more about their personality, or we can see how inaugural addresses have changed over time.
This course provides a hands-on introduction to working with text data. This includes an introduction to programming in Python, with a focus on text processing and data exploration, with a "cookbook" of programming examples that will enable you very quickly to analyze texts on your own. Most of the conclusions that we want to draw from text are "risky conclusions", they are trends rather than yes-or-no answers, so the course also includes an introduction to statistical techniques for data exploration and for making and assessing "risky conclusions". The course also includes a course project where you can test your text analysis skills on a question of your own choice.
Prerequisites: Upper-division standing.
Flags: Quantitatve, Independent Inquiry
Readings:
P. R. Hinton (2004): Statistics Explained: A Guide for Social Science Students. Psychology Press; 3rd edition, 2014
Additional readings will be made available for download from the course website.
In Spring 2022, this course is hybrid. More information below!
Syllabus
Course information
Course: Analyzing Linguistic Data, LIN350, unique number 39620
Semester: Spring 2022
Course page: https://www.katrinerk.com/courses/analyzing-linguistic-data-and-programming-for-linguists
Course times: Tuesday, Thursday 11-12:30
Course location: online via zoom, with links on Canvas. On in-person days, WAG 208.
Course on Canvas: https://utexas.instructure.com/courses/1331610
Course Slack space: invite link on Canvas
Instructor Contact Information
Instructor:
Office hours: Tuesdays 2-4, Fridays 10-11, via zoom. Zoom link is on Canvas.
Office: RLP 4.734
Contact: email katrin dot erk at utexas dot edu, or through Canvas, or through Slack
TA:
Sydney Willett
Office hours: Mondays 1-3 and Wednesdays 11-12, via zoom. Zoom link is on Canvas.
Contact: email listed on the Canvas page, or through Canvas, or through Slack
Prerequisites
Upper-division standing.
Syllabus and text
This page serves as the syllabus for this course.
Textbook:
P. R. Hinton: Statistics Explained: A Guide for Social Science Students. Psychology Press; 3rd edition.
Additional readings will be made available for download from the course website, in the course schedule.
Content overview
There are immense amounts of text data available in electronic form. And there are many questions that these texts could help us answer. Some of these questions are about people, how they feel, what opinions they express, what topics they talk about. Other questions are about language itself: how language is changing over time, how people use particular grammatical constructions, or what kinds of connotations words carry with them. The aim of this course is to give you the tools to automatically analyze texts to explore these kinds of questions.
By the end of this course, you will
know how to use simple word counts to answer many questions about people and about language, and know how to choose the right words for counting
know how to write programs in the Python programming language to access and analyze texts
know how to visualize and graph descriptive statistics about texts
be familiar with a toolkit of linguistic text preprocessing tools, and know how to use it
know what hypothesis testing is, and how to use it to distinguish actual findings from random variations in the data
be able to use machine learning tools to group data and chart the main topics being talked about in a collection of texts
A detailed schedule for the course, with topics for each lecture, is available in the Schedule section.
Flags
This course carries the Quantitative Reasoning flag. Quantitative Reasoning courses are designed to equip you with skills that are necessary for understanding the types of quantitative arguments you will regularly encounter in your adult and professional life. You should therefore expect a substantial portion of your grade to come from your use of quantitative skills to analyze real-world problems.
This course also carries the Independent Inquiry flag. Independent Inquiry courses are designed to engage you in the process of inquiry over the course of a semester, providing you with the opportunity for independent investigation of a question, problem, or project related to your major. You should therefore expect a substantial portion of your grade to come from the independent investigation and presentation of your own work.
Course requirements and grading policy
Assignments: 48% (4 assigments, 12% each)
"Food for thought" : 12% (4 mini-assigments, 3% each)
Course project:
Initial project description: 5%
Intermediate project report: 10%
Project presentation: 5%
Final report: 20%
Course projects should be done by teams of 2 students. Projects done by 1 or 3 students are only possible with prior approval of the instructor.
Project presentations will be in the final week of classes, in the order given on the schedule page (which will be generated via Python's random.shuffle()). If possible, all members of a project team should get some time to speak.
Assignments will be updated on Canvas. A tentative schedule for the entire semester is posted in the Schedule section. Readings may change up one week in advance of their due dates.
This course does not have a midterm or final exam.
Options for course projects, and more details on the project requirements are listed in the Project section.
The course will use plus-minus grading, using the following scale (showing Grade and Percentage):
A >= 93%
A- >= 90%
B+ >= 87%
B >= 83%
B- >= 80%
C+ >= 77%
C >= 73%
C- >= 70%
D+ >= 67%
D >= 63%
D- >= 60%
Attendance is not required. However, given that we will do a lot of hands-on exercises in class, and the homework assignments and the project address the material covered in class, good attendance is essential for doing well in this class.
Extension Policy
If you turn in your assignment late and we have not agreed on an extension beforehand, expect points to be deducted. Extensions will be considered on a case-by-case basis. I urge you to let me know if you are in need of an extension, such that we can make sure that you get the time necessary to complete the assignments.
If an extension has not been agreed on beforehand, then for assignments, by default, 5 points (out of 100) will be deducted for lateness, plus an additional 1 point for every 24-hour period beyond 2 that the assignment is late.
Note that there are always some points to be had, even if you turn in your assignment late. The last day in the semester on which the class meets is the last day to turn in late assignments for grading. Homework assignment submitted after that date will not be graded.
Classroom safety and Covid-19
To help preserve our in person learning environment, the university recommends the following.
Adhere to university mask guidance.
Vaccinations are widely available, free and not billed to health insurance. The vaccine will help protect against the transmission of the virus to others and reduce serious symptoms in those who are vaccinated.
Proactive Community Testing remains an important part of the university’s efforts to protect our community. Tests are fast and free.
Visit protect.utexas.edu for more information
Academic Dishonesty Policy
You are encouraged to discuss assignments with classmates. But all written work must be your own. Students caught cheating will automatically fail the course. If in doubt, ask the instructor.
Notice about students with disabilities
The University of Texas at Austin provides upon request appropriate academic accommodations for qualified students with disabilities. Please contact the Division of Diversity and Community Engagement, Services for Students with Disabilities, 471-6259.
Notice about missed work due to religious holy days
A student who misses an examination, work assignment, or other project due to the observance of a religious holy day will be given an opportunity to complete the work missed within a reasonable time after the absence, provided that he or she has properly notified the instructor. It is the policy of the University of Texas at Austin that the student must notify the instructor at least fourteen days prior to the classes scheduled on dates he or she will be absent to observe a religious holy day. For religious holy days that fall within the first two weeks of the semester, the notice should be given on the first day of the semester. The student will not be penalized for these excused absences, but the instructor may appropriately respond if the student fails to complete satisfactorily the missed assignment or examination within a reasonable time after the excused absence.
Emergency Evacuation Policy
Occupants of buildings on The University of Texas at Austin campus are required to evacuate buildings when a fire alarm is activated. Alarm activation or announcement requires exiting and assembling outside. Familiarize yourself with all exit doors of each classroom and building you may occupy. Remember that the nearest exit door may not be the one you used when entering the building. Students requiring assistance in evacuation shall inform their instructor in writing during the first week of class. In the event of an evacuation, follow the instruction of faculty or class instructors. Do not re-enter a building unless given instructions by the following: Austin Fire Department, The University of Texas at Austin Police Department, or Fire Prevention Services office. Information regarding emergency evacuation routes and emergency procedures can be found at http://www.utexas.edu/emergency.
Senate Bill 212 and Title IX Reporting Requirements
Under Senate Bill 212 (SB 212), the professor and TAs for this course are required to report for further investigation any information concerning incidents of sexual harassment, sexual assault, dating violence, and stalking committed by or against a UT student or employee. Federal law and university policy also requires reporting incidents of sex- and gender-based discrimination and sexual misconduct (collectively known as Title IX incidents). This means we cannot keep confidential information about any such incidents that you share with us. If you need to talk with someone who can maintain confidentiality, please contact University Health Services (512-471-4955 or 512-475-6877) or the UT Counseling and Mental Health Center (512-471-3515 or 512-471-2255). We strongly urge you make use of these services for any needed support and that you report any Title IX incidents to the Title IX Office.
Adapting the class format to deal with the ongoing pandemic
The first two weeks of class will be fully online.
After that, we move to the originally planned hybrid format, where some class sessions will be fully online, and others will be offered in person (though see next bullet point).
The schedule will list clearly which days are offered in person and which are fully online.All classes will be streamed on zoom, and students who prefer to take the class fully online will be enabled to do so.
Schedule
This schedule is subject to change.
Assignments are due at the end of the day on their due date. Please submit assignments online on Canvas unless the assignment tells you otherwise.
Readings can be done either before or after class (unless noted otherwise); they are chosen to support the material covered in class.
Week 1: Jan 18, 20: Introduction this week completely online
Tuesday: Introduction
Counting words to find out about people:
Google trends: word counting to get a sense of what people are interested in
Word watchers: word counting in politicians' speeches, and what that tells us about them
More political text analysis: variety of politicians' vocabulary
Counting words to find out about linguistic questions:
Thursday: Foundations of programming
Code for download: First steps in Python
Code for download: a notebook to check your installation
Code for download: how to use Jupyter notebooks
Week 2: Jan 25, 27: Exploring and visualizing data This week completely online
Tuesday: Exploring and visualizing data
We finish up the worksheet on First steps in Python
We then start on the next worksheet. Code for download: Exploring and visualizing data
We use this dataset.
Thursday: Exploring and visualizing data, continued
We continue using the worksheet on Exploring and visualizing data with this dataset.
Week 3: Feb 1, 3: Python basics .Tuesday session this week in person
Tuesday: Python programming basics: conditions, lists, and loops
Code for download: Conditions, lists, and loops
Thursday: No class, inclement weather
Food for Though 1 due
Week 4: Feb 8, 10: Python basics. Both sessions this week in person.
Tuesday: Conditions, lists, and loops, continued
We continue with the same notebook: Conditions, lists, and loops
Thursday: Programming basics: Dictionaries
Code for download: Dictionaries
Homework 1 due
Week 5: Feb 15, 17: Word counting Both sessions this week in person.
Tuesday: We discuss your project ideas in class
Thursday: Counting words
We finish up the notebook on Dictionaries
We then go on to the notebook on making Pandas dataframes, which shows you how to transfer word counts into a Pandas dataframe
Week 6: Feb 22, 24: Tools for text processing
Tuesday: Finishing up the notebook on making Pandas dataframes,
Thursday: Accessing text data, and tools for text processing
Code for download: Accessing text data, including a discussion of different writing systems, different sources
Code for download: text preprocessing
Initial project description due
Week 7: Mar 1, 3: More text processing, and Advanced analysis methods
Tuesday: Text processing, and taking text apart
Then: Identifying important wordsFinishing up the previous notebook: text preprocessing
Code for download: Regular expressions
Thursday: Regular expressions
Code for download: Regular expressions
Homework 2 due
Week 8: Mar 8, 10: Descriptive statistics
Tuesday: Identifying important words, and starting on clustering for exploratory data analysis
Code for download: identifying important words
Code for download: Clustering for exploratory data analysis
Thursday: Clustering, continued
Food for Thought 2 due
Week 9: Spring Break
Week 10: Mar 22, 24: Hypothesis testing
Tuesday: Topic modeling
Code for download: Topic modeling
Thursday: Central tendency and spread; Probability distributions, and starting on Hypothesis Testing
Code for download: Central tendency and spread
Code for download: probability distributions in Python
Week 11: Mar 29, 31: Hypothesis testing
Hypothesis testing: The t-test
Code for download: Hypothesis testing
Code for download: the t-test
Hypothesis testing: The t-test, continued
Code for download: the chi-squared test
Food for Thought 3 due
Week 12: Apr 5, 7: More programming
Tuesday: Python list comprehensions, and how to use them with Pandas
Code for download: Python list comprehensions
Code for download: error handling in Python
Thursday: defining your own functions, and structuring your programs
Project progress report due
Code for download: functions in Python
Week 13: Apr 12, 14: Correlation and regression
Tuesday: Correlation
Thursday: Linear regression. This session available as Panopto recording.
Code for download: linear regression
Code for download: more linear regression
Panopto recordings on Canvas: linear regression parts 1 and 2
Homework 3 due
Week 14: Apr 19, 21: Correlation and regression
Tuesday: Logistic regressionThis session available as Panopto recording.
Code for download: logistic regression
Code for download: model comparison
Panopto recordings on Canvas; linear regression part 4 (there is no part 3, I miscounted); logistic regression parts 1 and 2
Thursday: Practicing regression
Food for Though 4 due
Week 15: Apr 26, 28: Practicing regression
Tuesday: Practicing regression some more
Notebooks we'll be using: linear regression, more linear regression, logistic regression, model comparison
Code for download: Accessing multiple files in a directory (useful for your projects, I hope)
Thursday: Finishing up regression practice. Also: Python pandas: grouping, merging, appending frames
Code for download: more on Pandas
Homework 4 due
Week 16: May 3, 5: Project presentations
Tuesday: project presentations
11:00 Madison Verhalen
11:10 Dylan Moses and Nile Stewart and Will Sheffield
11:20 Andrea Cervantes and Mia Vargas
11:30 Alexandra Brown and Nick Umbrewicz
11:40 Miriam Montes and Chi Pham
11:50 Jaiden Nutt
Thursday: project presentations
11:00 Emily Luedke and Matthew Micyk
11:10 Nara Hafiz
11:20 Marquita Walker and Yating Wu
11:30 Paul Cho
11:40 Alexandra Fierro Morel and Sophie Brenner and Payton Wages
Final report due: May 11, end of day.
Course project information
Course project links
Here are links related to the course project ideas that you mentioned in class:
How words arise, change meaning, and get popularized on the internet: Here is a study on words in soccer fora on reddit: Del Tredici and Fernandez 2017.
The same authors then analyzed what made new coinages successful: Del Tredici and Fernandez 2018Syntactic annotation across languages: the Universal Dependencies project, with tons of downloadable data
How do people with different political affiliations talk about the same topic, do they use different words? Here is a recent paper that looked at this by studying different terms for the same concept: Pavlick et al 2020. Warning: This paper goes into advanced computational methods, so for now focus just on the data they're looking at.
Author identification, characterizing authors' styles: For out-of-copyright books, the Project Gutenberg is an excellent source.
Studying dialects via geographic information on Twitter: Eisenstein 2015.
Here is a dataset with geotagged tweets: The UMass Global English on Twitter datasetLanguage and ratings, for example: what words are used in reviews for expensive versus cheap wines? I'll dig up my old code and data, and will make those available if needed
Course project requirements
Course projects should be done by teams of 2 students. Project groups consisting of 1 or 3 students are possible only with prior approval of the instructor.
Initial project description
This is a 1-2 page document (single-spaced, single column) that describes what your project will be about. It needs to contain the following information:
Research questions: What are the main questions that you want to answer, the main language phenomena you want to address, or the main ideas you want to explore?
Method: What are the relevant words, multi-word expressions, or constructions you need to analyze? What descriptive data analyses do you plan to do? Do you plan to do statistical significance tests, and do you know already which ones will be the right ones? (Yes, I know you will not have worked out every detail at this point, but strive to work out as many as you can.)
Data: It is vital that you figure out as early as possible what data you can use. Is there enough data? Is it freely available? Do you have to contact someone to get it?
Intermediate report
This is a 1-2 page document (single-spaced, single column) that describes what the status of your project is at this point. This is a revised version of your initial project description. It needs to contain the following information:
Research questions: any changes?
Method: any changes?
Status:
Describe the data that was obtained: source, size, anything else that is relevant
Describe at least two (smaller, and preliminary) concrete results that you have at this point
You also need to take into account the feedback that you got on the Initial project description.
Short presentation
This is a short presentation to the class. You should discuss:
Research questions/linguistic phenomena/main questions you are addressing
Why is this relevant? (Spend a lot of time on the research questions and their relevance. Describing the big picture is important!)
Data: source, size (say how many words overall you have)
Results
You will need to prepare slides for this, which you submit to the instructor ahead of time.
It is okay if you don't have all results in place at this point. This does not lead to points being taken away for the presentation.
Final report
This is a 5-6 page document (single-spaced, single column) that describes the results of your project. This is a revised version of your intermediate project description. It needs to contain the following information:
Research questions/linguistic phenomena covered/main ideas pursued
Data: source, size, other relevant statistics
Method
Findings
If you build on previous work, you need to discuss it, and give references. Published papers (at conferences, in journals) go into the references list at the end of the paper. Links to blog posts and the like go in a footnote. Also, links to websites containing data go in a footnote, not in the references list.
You need to take into account the feedback that you got on the Initial project description and Intermediate report.
Course project ideas
Ideally, you pick a topic of your own that you are curious about. But to give you an idea of possible topics, here are a few pointers:
Some ideas from Language Log's breakfast experiments:
Which words are used to describe white and black NFL prospects? Links here, here (data for download in the 2nd link)
State of the Union: what are signature words of Obama, of earlier presidents? (And why?)
The statistics of real estate listings: linking real estate price to the language in the descriptions
Contrasting "almost" and "nearly": discussed here, here, here, and here
Author analysis: analyzing poems to detect who may have written them, and what characteristics they have
Online recipe collections: You could ask, for example, whether it is possible to predict the number of stars that a recipe will get from the recipe ingredients. See also Dan Jurafsky's language of food papers.
Noah Smith has a few nice datasets to analyze:
Movie corpus: predicting movie revenue from review texts
Congressional bill corpus: predicting whether a bill will survive from the text in the bill
Corporate reports corpus: predicting how well a company will do from the annual reports that it issues
Please discuss your topic with the instructor to make sure that it is both substantial and feasible.
For your course project, you will need to apply statistical analyses yourself. Google books n-gram charts, while pretty, do not count.
Links and additional readings
List of software we will use in the class
Python and Python packages:
We strongly recommend installing Anaconda, as that includes Python along with all Python packages we need.
If you install anaconda, you will have to add gensim. Here is a tutorial on how to add a package to Anaconda: https://docs.anaconda.com/anaconda/navigator/tutorials/manage-packages/#installing-a-package
Alternatively, you can individually install:
Python: https://www.python.org/downloads/ Any version >= 3.4 should be fine.
pandas: https://pandas.pydata.org/
numpy: https://numpy.org/
matplotlib: https://matplotlib.org/
Statsmodels: https://www.statsmodels.org/stable/index.html
NLTK: Installing NLTK itself: http://www.nltk.org/install.html You also need the NLTK data, see http://www.nltk.org/data.html
To test your Python installation, use this Jupyter notebook.
Slack:
We are using Slack for in-class discussions. Please see Canvas for the link to the class Slack space.
Tips and tricks:
Learning Python:
How To Think Like A Computer Scientist is a very good and accessible online Python textbook.
(Caution: It uses Python 2 rather than Python 3. One main difference you will note is that they omit the () around the print() command.)
Jupyter notebooks:
Fun with statistics
Language Log: a language and linguistics blog written by Mark Liberman and others
Bad science: Ben Goldacre's blog with lots of illustrations of what not to do in statistics
xkcd: A webcomic of romance, sarcasm, math, and language.