Fall 2021 | Instructor: Katrin Erk| MWF 10-11 | RLP 1.108
How can we describe the meaning of words and sentences in such a way that we can process them automatically? That seems a huge task. There are so many words, all with their individual nuances of meaning -- do we have to define them all by hand? And there are so many things we want to do with sentences: Translate them. Answer questions. Extract important pieces of information. Figure out people's opinions. Can we even use one single meaning description to do all these tasks?
In this course, we discuss methods for automatically learning what words mean (at least to some extent) from huge amounts of text -- for example, from all the text that people have made available on the web. And we discuss ways of representing the meaning of words and sentences in such a way that we can use them in language technology tasks.
We will look at two different kinds of meaning representations. Distributional representations, also called embeddings, characterize the meaning of a word or passage as an object in a "meaning space" that is learned automatically from data, in such a way that words with similar meanings will be close together in space. Logical representations translate sentences into a formal representations of the things, people and events that are mentioned and the connections between them, so that we can automatically reason with them and draw conclusions.
Prerequisites: Upper-division standing. LIN350 Analyzing Linguistic Data, or a different introduction to programming, or consent of instructor.
Readings will be made available for download from the course website.
Flags: Quantitatve, Independent Inquiry
Instructor and TA contact information
Syllabus and text
This page serves as the syllabus for this course.
There is no textbook for this course.
Readings will be made available for download from the course website, in the Schedule section.
Semantics is a very active area of computational linguistics -- but also a very diverse one. People work on word sense, semantic roles, selectional preferences, logic-based semantics, as well as on many semantics-related tasks and task-specific semantic representations. But there are problems that come up again and again in different tasks, and representation ideas that come up again and again in different variants. In this course, we focus on two influential classes of representations: structured (logic-based) semantics and distributional semantics, and on central phenomena that they address.
This course focuses on two frameworks in semantics, distributional models and logic-based semantics. Topics include:
A detailed schedule for the course, with topics for each lecture, is available in the Schedule section.
This course carries the Quantitative Reasoning flag. Quantitative Reasoning courses are designed to equip you with skills that are necessary for understanding the types of quantitative arguments you will regularly encounter in your adult and professional life. You should therefore expect a substantial portion of your grade to come from your use of quantitative skills to analyze real-world problems.
This course also carries the Independent Inquiry flag. Independent Inquiry courses are designed to engage you in the process of inquiry over the course of a semester, providing you with the opportunity for independent investigation of a question, problem, or project related to your major. You should therefore expect a substantial portion of your grade to come from the independent investigation and presentation of your own work. See the Course Project section for details.
Course requirements and grading policy
Course projects should be done by teams of 2 students. Projects done by 1 or 3 students are only possible with prior approval of the instructor.
Project presentations will be in the final week of classes, in the order given on the schedule page (which will be generated via Python's random.shuffle()). If possible, all members of a project team should get some time to speak.
Assignments will be updated on Canvas. There will be 4 assignments. A tentative schedule for the entire semester is posted in the Schedule section. Readings may change up one week in advance of their due dates.
This course does not have a midterm or final exam.
Options for course projects, and more details on the project requirements are listed in the Project section.
The course will use plus-minus grading, using the following scale (showing Grade and Percentage):
A >= 93%
A- >= 90%
B+ >= 87%
B >= 83%
B- >= 80%
C+ >= 77%
C >= 73%
C- >= 70%
D+ >= 67%
D >= 63%
D- >= 60%
Attendance is not required. However, given that we will do a lot of hands-on exercises in class, and the homework assignments and the project address the material covered in class, good attendance is essential for doing well in this class.
If you turn in your assignment late and we have not agreed on an extension beforehand, expect points to be deducted. Extensions will be considered on a case-by-case basis. I urge you to let me know if you are in need of an extension, such that we can make sure that you get the time necessary to complete the assignments.
If an extension has not been agreed on beforehand, then for assignments, by default, 5 points (out of 100) will be deducted for lateness, plus an additional 1 point for every 24-hour period beyond 2 that the assignment is late.
Note that there are always some points to be had, even if you turn in your assignment late. The last day in the semester on which the class meets is the last day to turn in late assignments for grading. Homework assignment submitted after that date will not be graded.
Classroom safety and Covid-19
To help preserve our in person learning environment, the university recommends the following.
Academic Dishonesty Policy
You are encouraged to discuss assignments with classmates. But all written work must be your own. Students caught cheating will automatically fail the course. If in doubt, ask the instructor.
Notice about students with disabilities
The University of Texas at Austin provides upon request appropriate academic accommodations for qualified students with disabilities. Please contact the Division of Diversity and Community Engagement, Services for Students with Disabilities, 471-6259.
Notice about missed work due to religious holy days
A student who misses an examination, work assignment, or other project due to the observance of a religious holy day will be given an opportunity to complete the work missed within a reasonable time after the absence, provided that he or she has properly notified the instructor. It is the policy of the University of Texas at Austin that the student must notify the instructor at least fourteen days prior to the classes scheduled on dates he or she will be absent to observe a religious holy day. For religious holy days that fall within the first two weeks of the semester, the notice should be given on the first day of the semester. The student will not be penalized for these excused absences, but the instructor may appropriately respond if the student fails to complete satisfactorily the missed assignment or examination within a reasonable time after the excused absence.
Emergency Evacuation Policy
Occupants of buildings on The University of Texas at Austin campus are required to evacuate buildings when a fire alarm is activated. Alarm activation or announcement requires exiting and assembling outside. Familiarize yourself with all exit doors of each classroom and building you may occupy. Remember that the nearest exit door may not be the one you used when entering the building. Students requiring assistance in evacuation shall inform their instructor in writing during the first week of class. In the event of an evacuation, follow the instruction of faculty or class instructors. Do not re-enter a building unless given instructions by the following: Austin Fire Department, The University of Texas at Austin Police Department, or Fire Prevention Services office. Information regarding emergency evacuation routes and emergency procedures can be found at http://www.utexas.edu/emergency.
Senate Bill 212 and Title IX Reporting Requirements
Under Senate Bill 212 (SB 212), the professor and TAs for this course are required to report for further investigation any information concerning incidents of sexual harassment, sexual assault, dating violence, and stalking committed by or against a UT student or employee. Federal law and university policy also requires reporting incidents of sex- and gender-based discrimination and sexual misconduct (collectively known as Title IX incidents). This means we cannot keep confidential information about any such incidents that you share with us. If you need to talk with someone who can maintain confidentiality, please contact University Health Services (512-471-4955 or 512-475-6877) or the UT Counseling and Mental Health Center (512-471-3515 or 512-471-2255). We strongly urge you make use of these services for any needed support and that you report any Title IX incidents to the Title IX Office.
Adapting the class format to deal with the ongoing pandemic
Here is the plan as of September 15, 2021:
The first three weeks of class will be fully online. We meet on zoom. The links are on the class Canvas page. Email me if you cannot access it.
The following two weeks, Wednesday classes are in person, on September 22 and September 29. Monday and Friday classes are on zoom.
Starting October 4, all classes will be offered in person. All in-person classes will be zoom-streamed and zoom-recorded. You can either come in person or participate via zoom, either way is fine. We'll continue to make class zoom recordings available.
This schedule is subject to change.
Assignments are due at the end of the day on their due date. Please submit assignments online on Canvas unless the assignment tells you otherwise.
Readings can be done either before or after class (unless noted otherwise); they are chosen to support the material covered in class.
Week 1: Aug 25, Aug 27
Week 2: Aug 30, Sep1, Sep 3
Wednesday: Pre-computed meaning spaces, and how to use them
Friday: What can I do for my course project?
Week 3: Sep 6, 8, 10
Week 4: Sep 13, 15, 17
Monday: How to make a count-based space and use it
Wednesday: Continuing with count-based spaces: using matrix methods for efficiency, and doing dimensionality reduction
Friday: Towards prediction-based spaces. Step 1: Classification
Class on zoom.
Week 5: Sep 20, 22, 24
Monday: Towards prediction-based spaces. Step 2: Logistic regression
Class on zoom.
Wednesday: Towards prediction-based spaces. Step 3: Word2vec, and how to easily make a word2vec space using gensim
Class is in person, in RLP 1.108. It is also zoom-streamed and recorded. Come in person or stay on zoom -- you decide.
Friday: Under the hood: a neural net that will compute a prediction-based space.
Class on zoom.
Week 6: Sep 27, Sep 29, Oct 1
Monday: We talk in class about your project ideas.
Class on zoom.
Wednesday: Making your own neural net, continued
Class is in person, in RLP 1.108. It is also zoom-streamed and recorded. Come in person or stay on zoom -- you decide.
Friday: Embeddings as components in NLP systems
Class on zoom.
Week 7: Oct 4, 6, 8
Monday: Language models. Embeddings for individual occurrences of a word: BERT and friends
Today and going forward, class is in person, in RLP 1.108. It is also zoom-streamed and recorded. Come in person or stay on zoom -- you decide.
Wednesday: Topic modeling: automatically detecting underlying themes in documents
Friday: Topic modeling continued
Week 8: Oct 11, 13, 15
Monday: An introduction to logic
Wednesday: Propositional logic
Friday: First-order logic
Week 9: Oct 18, 20, 22
Monday: Logic and automatic inference
Wednesday: "Translating" natural language sentences to logic
Friday: Translating natural language to logic, continued
Week 10: Oct 25, 27, 29
Monday: Semantics construction: automatically constructing a logical representation for a sentence
Wednesday: Semantics construction and lambda calculus: lego for logic
Friday: Semantics construction in practice with the Natural Language Toolkit
Week 11: Nov 1, 3, 5
Monday: More semantics construction in practice
Wednesday: Structured meaning representations: Abstract Meaning Representations
Friday: Tasks in structured meaning representation: word sense disambiguation and ontologies
Week 12: Nov 8, 10, 12
Monday: Tasks in structured meaning representation: semantic role labeling
Wednesday: Tasks in structured meaning representation: event representation
Friday: Tasks in structured meaning representation: coreference
Week 13: Nov 15, 17, 19
Monday: Structured meaning representations: the Groningen MeaningBank
Wednesday: Information Extraction, and Open Information Extraction
Friday: Knowledge graphs
Week 14: Nov 22
Week 15: Nov 29, Dec 1, Dec 3
Monday: Knowledge graphs and embeddings
Wednesday: Knowledge graphs and embeddings, continued
Friday: Project presentations
Week 16: Dec 6
Final report due: tba
Course projects should be done by teams of 2 students. Project groups consisting of 1 or 3 students are possible only with prior approval of the instructor.
Course project requirements
Initial project description
This is a 1-2 page document (single-spaced, single column) that describes what your project will be about. It needs to contain the following information:
Research questions: What are the main questions that you want to answer, the main language phenomena you want to address, or the main ideas you want to explore?
Method: What distributional model will you use, or what kinds of rules are you planning to state? Be as detailed as you can. (Yes, I know you will not have worked out every detail at this point, but strive to work out as many as you can.)
Data: If you do a distributional project, it is vital that you figure out as early as possible what data you can use to learn your model. Is there enough data? Is it freely available? Do you have to contact someone to get it?
This is a 1-2 page document (single-spaced, single column) that describes what the status of your project is at this point. This is a revised version of your initial project description. It needs to contain the following information:
You also need to take into account the feedback that you got on the Initial project description.
This is a short presentation to the class. You should discuss:
Research questions/linguistic phenomena/main ideas you wanted to model
Why is this relevant? (Spend a lot of time on the research questions and their relevance. Describing the big picture is important!)
Data, if you are using a data-driven approach: source, size
You will need to prepare slides for this, which you submit to the instructor ahead of time.
This is a 5-6 page document (single-spaced, single column) that describes the results of your project. This is a revised version of your intermediate project description. It needs to contain the following information:
Research questions/linguistic phenomena covered/main ideas pursued
Data: source, size, other relevant statistics
If you build on previous work, you need to discuss it, and give references. Published papers (at conferences, in journals) go into the references list at the end of the paper. Links to blog posts and the like go in a footnote. Also, links to websites containing data go in a footnote, not in the references list.
You need to take into account the feedback that you got on the Initial project description and Intermediate report.
Context-based vectors/embeddings for words
Comparing general and specific terms (hyponyms and hypernyms) in vector spaces
Exploring prejudice in vector spaces, and possibly removing it
Exploring analogy reasoning in vector spaces
Make vectors for occurrences of words, and group (cluster) them into senses
What clusters of words (clustered by vector representations) are used a lot in a politician's speech, or in top-10 songs?
Comparing general and specific words (like "animal" versus "dog") in vector spaces: can you detect which specific words go with which general words? How well does this work in different spaces?
Compute your own:
How do people use emojis? That is, what are the context vectors of emojis?
Compute vector representations from two different time periods: How have word meanings changed? Or, how has the discourse/use around the words changed?
Compute vector representations from two different corpus collections, and do the same kind of analysis
Topic modeling for documents
Structured meaning representations
Build a system for automatic word sense disambiguation or semantic role labeling using machine learning
Build a system that automatically identifies events in text, using a tool that gives you the syntactic structure of a sentence and using rules that identify events in that syntactic structure
Build a system that automatically identifies medication names, or illness names, in medical texts
Links and additional readings
Tutorials and texts about distributional models
Pre-computed spaces ready for use
Links that may give you ideas for course projects
SemEval, a series of workshops on semantics-related tasks: They come up with 10-ish new tasks every year, and they offer data for it, so this may be an easy way to get data
Automatically getting a structured meaning representation: See the Semantic Role Labeling demo and the Open Information Extraction demo from AllenNLP.
How can we use computational tool to answer linguistic questions? For ideas (with much too large projects!) for what others have done, see the Society for Computing in Linguistics conferences
Logic-based semantics in the Natural Language Toolkit
Some freely available corpora
Data that can be used to build distributional models:
The WaCKy corpora, including UKWaC (English web text, 2B words), Wackypedia (an English Wikipedia dump, 2B words), web corpora for French, German, and Italian. Ask me about a parsed version of UKWaC and Wackypedia if you need syntactic analysis.
Structured semantic annotation:
Systems and online demos for logic-based semantic analysis
Additional readings about logic-based computational semantics
An in-depth overview of everything:
Practical guides to building logic-based semantics:
Focusing on the theory: