LIN350 Computational semantics
Fall 2021 | Instructor: Katrin Erk| MWF 10-11 | RLP 1.108
How can we describe the meaning of words and sentences in such a way that we can process them automatically? That seems a huge task. There are so many words, all with their individual nuances of meaning -- do we have to define them all by hand? And there are so many things we want to do with sentences: Translate them. Answer questions. Extract important pieces of information. Figure out people's opinions. Can we even use one single meaning description to do all these tasks?
In this course, we discuss methods for automatically learning what words mean (at least to some extent) from huge amounts of text -- for example, from all the text that people have made available on the web. And we discuss ways of representing the meaning of words and sentences in such a way that we can use them in language technology tasks.
We will look at two different kinds of meaning representations. Distributional representations, also called embeddings, characterize the meaning of a word or passage as an object in a "meaning space" that is learned automatically from data, in such a way that words with similar meanings will be close together in space. Logical representations translate sentences into a formal representations of the things, people and events that are mentioned and the connections between them, so that we can automatically reason with them and draw conclusions.
Prerequisites: Upper-division standing. LIN350 Analyzing Linguistic Data, or a different introduction to programming, or consent of instructor.
Readings will be made available for download from the course website.
Flags: Quantitatve, Independent Inquiry
Course: Computational semantics, LIN350 - 40615
Semester: Fall 2021
Course page: http://www.katrinerk.com/courses/computational-semantics-undergraduate
Course times: MWF 10-11
Course location: RLP 2.210
Course on Canvas: https://utexas.instructure.com/courses/1311336
Instructor and TA contact information
Instructor: Katrin Erk
Teaching Assistant: Hongli Zhan
Office hours: 2:30pm-4pm, Mondays and Fridays. Office hours will be via Zoom, the link is on Canvas.
email: honglizhan at utexas dot edu
LIN350 Analyzing Linguistic Data, or a different introduction to programming, or consent of instructor.
Syllabus and text
This page serves as the syllabus for this course.
There is no textbook for this course.
Readings will be made available for download from the course website, in the Schedule section.
Semantics is a very active area of computational linguistics -- but also a very diverse one. People work on word sense, semantic roles, selectional preferences, logic-based semantics, as well as on many semantics-related tasks and task-specific semantic representations. But there are problems that come up again and again in different tasks, and representation ideas that come up again and again in different variants. In this course, we focus on two influential classes of representations: structured (logic-based) semantics and distributional semantics, and on central phenomena that they address.
This course focuses on two frameworks in semantics, distributional models and logic-based semantics. Topics include:
Embeddings / Distributional representations:
Using distributional representations to analyze similarity in meaning
Making embeddings by counting words or using neural networks
Translating natural language sentences to logic
Automatically constructing the logical representation of a sentence: semantics construction
Structured semantic representations more generally:
Tasks: word sense disambiguation, semantic role labeling, coreference, and so on
Variants of structured representation
Knowledge graphs, and their integration with embeddings
A detailed schedule for the course, with topics for each lecture, is available in the Schedule section.
This course carries the Quantitative Reasoning flag. Quantitative Reasoning courses are designed to equip you with skills that are necessary for understanding the types of quantitative arguments you will regularly encounter in your adult and professional life. You should therefore expect a substantial portion of your grade to come from your use of quantitative skills to analyze real-world problems.
This course also carries the Independent Inquiry flag. Independent Inquiry courses are designed to engage you in the process of inquiry over the course of a semester, providing you with the opportunity for independent investigation of a question, problem, or project related to your major. You should therefore expect a substantial portion of your grade to come from the independent investigation and presentation of your own work. See the Course Project section for details.
Course requirements and grading policy
Assignments: 60% (4 assigments, 15% each)
Initial project description: 5%
Intermediate project report: 10%
Course presentation: 5%
Final report: 20%
Course projects should be done by teams of 2 students. Projects done by 1 or 3 students are only possible with prior approval of the instructor.
Project presentations will be in the final week of classes, in the order given on the schedule page (which will be generated via Python's random.shuffle()). If possible, all members of a project team should get some time to speak.
Assignments will be updated on Canvas. There will be 4 assignments. A tentative schedule for the entire semester is posted in the Schedule section. Readings may change up one week in advance of their due dates.
This course does not have a midterm or final exam.
Options for course projects, and more details on the project requirements are listed in the Project section.
The course will use plus-minus grading, using the following scale (showing Grade and Percentage):
A >= 93%
A- >= 90%
B+ >= 87%
B >= 83%
B- >= 80%
C+ >= 77%
C >= 73%
C- >= 70%
D+ >= 67%
D >= 63%
D- >= 60%
Attendance is not required. However, given that we will do a lot of hands-on exercises in class, and the homework assignments and the project address the material covered in class, good attendance is essential for doing well in this class.
If you turn in your assignment late and we have not agreed on an extension beforehand, expect points to be deducted. Extensions will be considered on a case-by-case basis. I urge you to let me know if you are in need of an extension, such that we can make sure that you get the time necessary to complete the assignments.
If an extension has not been agreed on beforehand, then for assignments, by default, 5 points (out of 100) will be deducted for lateness, plus an additional 1 point for every 24-hour period beyond 2 that the assignment is late.
Note that there are always some points to be had, even if you turn in your assignment late. The last day in the semester on which the class meets is the last day to turn in late assignments for grading. Homework assignment submitted after that date will not be graded.
Classroom safety and Covid-19
To help preserve our in person learning environment, the university recommends the following.
Adhere to university mask guidance.
Vaccinations are widely available, free and not billed to health insurance. The vaccine will help protect against the transmission of the virus to others and reduce serious symptoms in those who are vaccinated.
Proactive Community Testing remains an important part of the university’s efforts to protect our community. Tests are fast and free.
Visit protect.utexas.edu for more information
Academic Dishonesty Policy
You are encouraged to discuss assignments with classmates. But all written work must be your own. Students caught cheating will automatically fail the course. If in doubt, ask the instructor.
Notice about students with disabilities
The University of Texas at Austin provides upon request appropriate academic accommodations for qualified students with disabilities. Please contact the Division of Diversity and Community Engagement, Services for Students with Disabilities, 471-6259.
Notice about missed work due to religious holy days
A student who misses an examination, work assignment, or other project due to the observance of a religious holy day will be given an opportunity to complete the work missed within a reasonable time after the absence, provided that he or she has properly notified the instructor. It is the policy of the University of Texas at Austin that the student must notify the instructor at least fourteen days prior to the classes scheduled on dates he or she will be absent to observe a religious holy day. For religious holy days that fall within the first two weeks of the semester, the notice should be given on the first day of the semester. The student will not be penalized for these excused absences, but the instructor may appropriately respond if the student fails to complete satisfactorily the missed assignment or examination within a reasonable time after the excused absence.
Emergency Evacuation Policy
Occupants of buildings on The University of Texas at Austin campus are required to evacuate buildings when a fire alarm is activated. Alarm activation or announcement requires exiting and assembling outside. Familiarize yourself with all exit doors of each classroom and building you may occupy. Remember that the nearest exit door may not be the one you used when entering the building. Students requiring assistance in evacuation shall inform their instructor in writing during the first week of class. In the event of an evacuation, follow the instruction of faculty or class instructors. Do not re-enter a building unless given instructions by the following: Austin Fire Department, The University of Texas at Austin Police Department, or Fire Prevention Services office. Information regarding emergency evacuation routes and emergency procedures can be found at http://www.utexas.edu/emergency.
Senate Bill 212 and Title IX Reporting Requirements
Under Senate Bill 212 (SB 212), the professor and TAs for this course are required to report for further investigation any information concerning incidents of sexual harassment, sexual assault, dating violence, and stalking committed by or against a UT student or employee. Federal law and university policy also requires reporting incidents of sex- and gender-based discrimination and sexual misconduct (collectively known as Title IX incidents). This means we cannot keep confidential information about any such incidents that you share with us. If you need to talk with someone who can maintain confidentiality, please contact University Health Services (512-471-4955 or 512-475-6877) or the UT Counseling and Mental Health Center (512-471-3515 or 512-471-2255). We strongly urge you make use of these services for any needed support and that you report any Title IX incidents to the Title IX Office.
Adapting the class format to deal with the ongoing pandemic
Here is the plan as of September 15, 2021:
The first three weeks of class will be fully online. We meet on zoom. The links are on the class Canvas page. Email me if you cannot access it.
The following two weeks, Wednesday classes are in person, on September 22 and September 29. Monday and Friday classes are on zoom.
Starting October 4, all classes will be offered in person. All in-person classes will be zoom-streamed and zoom-recorded. You can either come in person or participate via zoom, either way is fine. We'll continue to make class zoom recordings available.
This schedule is subject to change.
Assignments are due at the end of the day on their due date. Please submit assignments online on Canvas unless the assignment tells you otherwise.
Readings can be done either before or after class (unless noted otherwise); they are chosen to support the material covered in class.
Week 1: Aug 25, Aug 27
Wednesday: Computational semantics: an overview
Friday: Meaning as a space in which you can walk from word to word: an introduction
Week 2: Aug 30, Sep1, Sep 3
Monday: Continuing with the introduction to meaning spaces
Slides (same link as Friday)
Software we will use for class:
We strongly recommend installing Anaconda, as that includes Python along with almost all Python packages we need.
If you install anaconda, you will have to add gensim. Here is a tutorial on how to add a package to Anaconda: https://docs.anaconda.com/anaconda/navigator/tutorials/manage-packages/#installing-a-package
Please do this, and choose to add gensim.
Alternatively, you can individually install:
Python: https://www.python.org/downloads/ Any version >= 3.4 should be fine.
NLTK: Installing NLTK itself: http://www.nltk.org/install.html You also need the NLTK data, see http://www.nltk.org/data.html
Wednesday: Pre-computed meaning spaces, and how to use them
For pre-computed meaning spaces, see links below under "Links"
Friday: What can I do for my course project?
Week 3: Sep 6, 8, 10
Monday: Labor day
Wednesday: Using distributional spaces in cognition and lexical semantics
Slides: also use these to get ideas for your course projects
Friday: How to make a count-based space
We'll also continue using these slides
An accessible introduction to count-based models is in Section 2 of this paper (but only that section)
Supporting material: Jurafsky and Martin ed. 3 ch. 6 up to 6.7
Week 4: Sep 13, 15, 17
Monday: How to make a count-based space and use it
We'll continue using this Jupyter notebook
Wednesday: Continuing with count-based spaces: using matrix methods for efficiency, and doing dimensionality reduction
Homework 1 due.
We continue to use this Jupyter notebook and these slides
Friday: Towards prediction-based spaces. Step 1: Classification
Class on zoom.
Week 5: Sep 20, 22, 24
Monday: Towards prediction-based spaces. Step 2: Logistic regression
Class on zoom.
Readings: Jurafsky and Martin ed. 3, chapter 5. This goes into much more detail than we did in class. You don't need to read up on details that we did not cover, but you can if you are curious.
Wednesday: Towards prediction-based spaces. Step 3: Word2vec, and how to easily make a word2vec space using gensim
Class is in person, in RLP 1.108. It is also zoom-streamed and recorded. Come in person or stay on zoom -- you decide.
Readings: Jurafsky and Martin ed. 3, chapter 6, section 6.8. This goes into more detail than we did in class. You don't need to read up on details that we did not cover, but you can if you are curious.Monday: Prediction-based embeddings in practice: How to compute them, and how to test whether they are okay
Friday: Under the hood: a neural net that will compute a prediction-based space.
Class on zoom.
Week 6: Sep 27, Sep 29, Oct 1
Monday: We talk in class about your project ideas.
Class on zoom.
Wednesday: Making your own neural net, continued: deeper networks.
Class is in person, in RLP 1.108. It is also zoom-streamed and recorded. Come in person or stay on zoom -- you decide.
Friday: Making your own neural net, continued: a prediction-based space.
Class on zoom.
Week 7: Oct 4, 6, 8
Monday: Language models. Embeddings for individual occurrences of a word: BERT and friends
Today and going forward, class is in person, in RLP 1.108. It is also zoom-streamed and recorded. Come in person or stay on zoom -- you decide.
Wednesday: Vector arithmethic: How to compute directions in space that indicate interesting "semantic directions", for example gender
Friday: Topic modeling: automatically detecting underlying themes in documents
Week 8: Oct 11, 13, 15
Monday: Structured meaning representations: word senses and ontologies
Additional reading: Jurafsky and Martin 3rd edition, Word Senses and WordNet
Wednesday: Structured meaning representations: semantic role labeling
Additional reading: Jurafsky and Martin 3rd edition, Semantic Role Labeling and Argument Structure
Friday: Structured meaning representations: Semantic roles continued. Then, coreference
Additional reading: Jurafsky and Martin 3rd edition, Coreference resolution
Week 9: Oct 18, 20, 22
Monday: Structured meaning representations: Events: arguments, subevents, coreference
Additional reading: slideset on Canvas
Wednesday: An introduction to logic
Initial project description due.
Material to use in class: a logic puzzle. Also, logical formalizations -- don't look at them before class: logic puzzle with logical formalization, the same as text file, and a Python script that tries to derive the solution of the logic puzzle.
Friday: Propositional logic
(Note that these tutorials use a sligthly different notation.)
or, if you want an in-depth textbook, check out L.T.F. Gamut volume 1. The introduction to propositional logic starts on page 28.
Week 10: Oct 25, 27, 29
Monday: Logic and automatic inference
Here is an online demo of Robinson Resolution: https://logictools.org/prop.html
To use it, choose "Propositional logic" from the riders on top. Then in the first row of choices below the example window, choose using: "resolution:naive" showing "html trace". Then hit the "Solve" button.
Wednesday: First-order logic
For an extended introduction, again look at L.T.F. Gamut volume 1. The discussion of first-order logic starts on page 65.
Friday: "Translating" natural language sentences to logic
Please only look at this after class: solutions to the first and second set of sentences
Week 11: Nov 1, 3, 5
Monday: Translating natural language to logic, continued
Wednesday: Translating natural language to logic, continued
Homework 3 due.
Friday: Translating natural language to logic, continued
Week 12: Nov 8, 10, 12
Monday: Semantics construction: automatically constructing a logical representation for a sentence. Lambda calculus: lego for semantics construction
Wednesday: More lambda calculus. We go through some examples together.
Intermediate project report due.
Friday: Final piece of lambda calculus: transitive verbs. Then: Semantics construction in practice with the Natural Language Toolkit
Week 13: Nov 15, 17, 19
Monday: Semantics construction in practice with the Natural Language Toolkit, continued
Wednesday: Back to structured meaning representations: Abstract Meaning Representations, and the Groningen MeaningBank
We'll look at these examples:
Friday: More on structured meaning representations.
Week 14: Nov 22
Monday: Knowledge graphs
Homework 4 due.
Wednesday, Friday: Thanksgiving break
Week 15: Nov 29, Dec 1, Dec 3
Monday: Knowledge graphs
Wednesday: Knowledge graphs: construction through Information Extraction, extension through link prediction (using embeddings of graph nodes and edges!)
J&M 3rd edition, chapter on information extraction: examples on pages 2-4, algorithms for relation extraction after that (this goes far beyond what we discuss in class)
TransE: the original paper is here (sorry, this is rough reading. I couldn't find any decent tutorial)
Friday: Project presentations
10:00 John Steinman
10:07 Aubrey Hinchman and Grey Sandstrum
10:14 Sebastian Mancha and Grace Huang
10:21 Sydney Willett and Kabin Moon
10:28 Riddhi Bhave and Pooja Chivukula
Week 16: Dec 6
Monday: Project presentations, and some final words
10:00 Alyssa Cantu
10:07 Matthew Pabst
10:14 Fei Guo
10:21 Katrina Gavura and Nicolette Warren
10:28 Misty Peng and Manaasa Darisi
10:35 Vittoria Byland
Final report due: Friday December 10, end of day
Course projects should be done by teams of 2 students. Project groups consisting of 1 or 3 students are possible only with prior approval of the instructor.
Course project requirements
Initial project description
This is a 1-2 page document (single-spaced, single column) that describes what your project will be about. It needs to contain the following information:
Research questions: What are the main questions that you want to answer, the main language phenomena you want to address, or the main ideas you want to explore?
Method: What distributional model will you use, or what kinds of rules are you planning to state? Be as detailed as you can. (Yes, I know you will not have worked out every detail at this point, but strive to work out as many as you can.)
Data: If you do a distributional project, it is vital that you figure out as early as possible what data you can use to learn your model. Is there enough data? Is it freely available? Do you have to contact someone to get it?
This is a 1-2 page document (single-spaced, single column) that describes what the status of your project is at this point. This is a revised version of your initial project description. It needs to contain the following information:
Research questions: any changes?
Method: any changes?
Describe the data that was obtained: source, size, anything else that is relevant
Describe at least two (smaller, and preliminary) concrete results that you have at this point
You also need to take into account the feedback that you got on the Initial project description.
This is a short presentation to the class. You should discuss:
Research questions/linguistic phenomena/main ideas you wanted to model
Why is this relevant? (Spend a lot of time on the research questions and their relevance. Describing the big picture is important!)
Data, if you are using a data-driven approach: source, size
You will need to prepare slides for this, which you submit to the instructor ahead of time.
This is a 5-6 page document (single-spaced, single column) that describes the results of your project. This is a revised version of your intermediate project description. It needs to contain the following information:
Research questions/linguistic phenomena covered/main ideas pursued
Data: source, size, other relevant statistics
If you build on previous work, you need to discuss it, and give references. Published papers (at conferences, in journals) go into the references list at the end of the paper. Links to blog posts and the like go in a footnote. Also, links to websites containing data go in a footnote, not in the references list.
You need to take into account the feedback that you got on the Initial project description and Intermediate report.
Course project ideas
Context-based vectors/embeddings for words
Comparing general and specific terms (hyponyms and hypernyms) in vector spaces
Exploring prejudice in vector spaces, and possibly removing it
In this context, check out the paper Man is to Computer Programmer as Woman is to Homemaker? Debiasing word embeddings.
Exploring analogy reasoning in vector spaces
Make vectors for occurrences of words, and group (cluster) them into senses
What clusters of words (clustered by vector representations) are used a lot in a politician's speech, or in top-10 songs?
Comparing general and specific words (like "animal" versus "dog") in vector spaces: can you detect which specific words go with which general words? How well does this work in different spaces?
Compute your own:
How do people use emojis? That is, what are the context vectors of emojis?
Compute vector representations from two different time periods: How have word meanings changed? Or, how has the discourse/use around the words changed?
Compute vector representations from two different corpus collections, and do the same kind of analysis
Topic modeling for documents
Automatically determine topics (word groups) that occur a lot in a collection of documents. Can you see patterns in which documents tend to have which topics?
Structured meaning representations
Build a system for automatic word sense disambiguation or semantic role labeling using machine learning
Build a system that automatically identifies events in text, using a tool that gives you the syntactic structure of a sentence and using rules that identify events in that syntactic structure
Build a system that automatically identifies medication names, or illness names, in medical texts
Links and additional readings
Tutorials and texts about distributional models
Jurafsky and Martin, chapter 6 of their upcoming 3rd edition of Speech and Language Processing
Count-based distributional representations:
Embeddings, based on neural networks/ deep learning:
Probabilistic approaches: Latent Dirichlet Allocation (LDA), topic models:
Mark Steyvers & Thomas Griffiths (2007), Probabilistic topic models
Kevin Knight (2009), Bayesian Inference with Tears
Alexander Koller, LDA with pirates and ninjas, and non-parametric Bayesian models with pirates and ninjas
"Semantic directions" in distributional spaces
Vector offsets and analogies: visualization at https://lamyiowce.github.io/word2viz/, a simple primer with python code at https://towardsdatascience.com/how-to-solve-analogies-with-word2vec-6ebaf2354009
Bolukbasi et al on biases in word embeddings and de-biasing: https://proceedings.neurips.cc/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html
Many thanks to Fei Guo for pointing out a paper on the "geometry of culture" by Kozlovski et al: https://journals.sagepub.com/doi/full/10.1177/0003122419877135
This uses the same idea as Bolukbasi et al: Collect some words that together characterize a "semantic direction", for example "poor", "poverty", "inexpensive", "impoverished" for a direction of poverty. Sum them up to get their average direction. Analyze other words for how strongly they have this "semantic direction" in them.
Grand et al use the idea of semantic directions to order words on a scale, for example animals on a scale from small to large. They argue that if this can be done with word embeddings, this shows that people can in principle learn such information from text alone because people, too, are sensitive to statistical regularities in text data: https://arxiv.org/abs/1802.01241
Pre-computed spaces ready for use
Start here: a collection of many pre-computed spaces
For many languages: FastText vectors for 157 languages
Links that may give you ideas for course projects
SemEval, a series of workshops on semantics-related tasks: They come up with 10-ish new tasks every year, and they offer data for it, so this may be an easy way to get data
Automatically getting a structured meaning representation: See the Semantic Role Labeling demo and the Open Information Extraction demo from AllenNLP.
How can we use computational tool to answer linguistic questions? For ideas (with much too large projects!) for what others have done, see the Society for Computing in Linguistics conferences
Logic-based semantics in the Natural Language Toolkit
NLTK book chapter: Analyzing semantics in the Natural Language Toolkit
A paper by the authors of the semantics part of NLTK: Dan Garrette and Ewan Klein (2009), An Extensible Toolkit for Computational Semantics. Proceedings of IWCS.
Discourse representation (DRT) in NLTK: the package
Some freely available corpora
Data that can be used to build distributional models:
The WaCKy corpora, including UKWaC (English web text, 2B words), Wackypedia (an English Wikipedia dump, 2B words), web corpora for French, German, and Italian. Ask me about a parsed version of UKWaC and Wackypedia if you need syntactic analysis.
Structured semantic annotation:
The Groningen Parallel Meaning Bank
Systems and online demos for logic-based semantic analysis
LinGO English Resource Grammar: demo that generates Minimal Recursion Semantics representations
Theorem proving (automatic inference): Prover9 and Mace4
Systems and online demos for structured semantics and syntactic preprocessing
Semantic role labeling: https://demo.allennlp.org/semantic-role-labeling
Open information extraction: https://demo.allennlp.org/open-information-extraction
Coreference resolution: https://demo.allennlp.org/coreference-resolution
Stanford CoreNLP: https://stanfordnlp.github.io/CoreNLP/demo.html
Preprocessing with spacy (in Python): https://spacy.io/usage/facts-figures
Additional readings about logic-based computational semantics
An in-depth overview of everything:
Jurafsky and Martin, chapter 15 of their upcoming 3rd edition of Speech and Language Processing
Practical guides to building logic-based semantics:
Blackburn and Bos, using Prolog: Representation and Inference for Natural Language,
van Eijck and Unger, using Haskell: Computational Semantics with Functional Programming
Focusing on the theory:
L.T.F Gamut: Logic, Language, and Meaning (2 volumes). Volume 1