LIN350 Computational semantics

Course overview

Syllabus

Adapting the class format to deal with the ongoing pandemic

Schedule

Course Project

Course project requirements

Course project ideas

Links and additional readings

Fall 2021 | Instructor: Katrin Erk| MWF 10-11 | RLP 1.108

Course overview

How can we describe the meaning of words and sentences in such a way that we can process them automatically? That seems a huge task. There are so many words, all with their individual nuances of meaning -- do we have to define them all by hand? And there are so many things we want to do with sentences: Translate them. Answer questions. Extract important pieces of information. Figure out people's opinions. Can we even use one single meaning description to do all these tasks?

In this course, we discuss methods for automatically learning what words mean (at least to some extent) from huge amounts of text -- for example, from all the text that people have made available on the web. And we discuss ways of representing the meaning of words and sentences in such a way that we can use them in language technology tasks.

We will look at two different kinds of meaning representations. Distributional representations, also called embeddings, characterize the meaning of a word or passage as an object in a "meaning space" that is learned automatically from data, in such a way that words with similar meanings will be close together in space. Logical representations translate sentences into a formal representations of the things, people and events that are mentioned and the connections between them, so that we can automatically reason with them and draw conclusions.

Prerequisites: Upper-division standing. LIN350 Analyzing Linguistic Data, or a different introduction to programming, or consent of instructor.

Readings will be made available for download from the course website.

Flags: Quantitatve, Independent Inquiry

Syllabus

Course overview

Course: Computational semantics, LIN350 - 40615
Semester: Fall 2021
Course page: http://www.katrinerk.com/courses/computational-semantics-undergraduate
Course times: MWF 10-11
Course location: RLP 2.210
Course on Canvas: https://utexas.instructure.com/courses/1311336

Instructor and TA contact information

- Instructor: Katrin Erk
  - Office hours: Tuesday 1-3pm, Friday 11-12pm. Office hours will be via Zoom, the link is on Canvas.
  - Office: RLP 4.734
  - email: katrin dot erk at utexas dot edu
- Teaching Assistant: Hongli Zhan
  - Office hours: 2:30pm-4pm, Mondays and Fridays. Office hours will be via Zoom, the link is on Canvas.
  - email: honglizhan at utexas dot edu

Prerequisites

Upper-division standing.
LIN350 Analyzing Linguistic Data, or a different introduction to programming, or consent of instructor.

Syllabus and text

This page serves as the syllabus for this course.

There is no textbook for this course.

Readings will be made available for download from the course website, in the Schedule section.

Content overview

Semantics is a very active area of computational linguistics -- but also a very diverse one. People work on word sense, semantic roles, selectional preferences, logic-based semantics, as well as on many semantics-related tasks and task-specific semantic representations. But there are problems that come up again and again in different tasks, and representation ideas that come up again and again in different variants. In this course, we focus on two influential classes of representations: structured (logic-based) semantics and distributional semantics, and on central phenomena that they address.

This course focuses on two frameworks in semantics, distributional models and logic-based semantics. Topics include:

Embeddings / Distributional representations:
- Using distributional representations to analyze similarity in meaning
- Making embeddings by counting words or using neural networks
Logic-based semantics:
- Translating natural language sentences to logic
- Automatic reasoning
- Automatically constructing the logical representation of a sentence: semantics construction
Structured semantic representations more generally:
- Tasks: word sense disambiguation, semantic role labeling, coreference, and so on
- Variants of structured representation
- Knowledge graphs, and their integration with embeddings

A detailed schedule for the course, with topics for each lecture, is available in the Schedule section.

Flags

This course carries the Quantitative Reasoning flag. Quantitative Reasoning courses are designed to equip you with skills that are necessary for understanding the types of quantitative arguments you will regularly encounter in your adult and professional life. You should therefore expect a substantial portion of your grade to come from your use of quantitative skills to analyze real-world problems.

This course also carries the Independent Inquiry flag. Independent Inquiry courses are designed to engage you in the process of inquiry over the course of a semester, providing you with the opportunity for independent investigation of a question, problem, or project related to your major. You should therefore expect a substantial portion of your grade to come from the independent investigation and presentation of your own work. See the Course Project section for details.

Course requirements and grading policy

Assignments: 60% (4 assigments, 15% each)
Course project:
- Initial project description: 5%
- Intermediate project report: 10%
- Course presentation: 5%
- Final report: 20%

Course projects should be done by teams of 2 students. Projects done by 1 or 3 students are only possible with prior approval of the instructor.

Project presentations will be in the final week of classes, in the order given on the schedule page (which will be generated via Python's random.shuffle()). If possible, all members of a project team should get some time to speak.

Assignments will be updated on Canvas. There will be 4 assignments. A tentative schedule for the entire semester is posted in the Schedule section. Readings may change up one week in advance of their due dates.

This course does not have a midterm or final exam.

Options for course projects, and more details on the project requirements are listed in the Project section.

The course will use plus-minus grading, using the following scale (showing Grade and Percentage):

A >= 93%
A- >= 90%
B+ >= 87%
B >= 83%
B- >= 80%
C+ >= 77%
C >= 73%
C- >= 70%
D+ >= 67%
D >= 63%
D- >= 60%

Attendance is not required. However, given that we will do a lot of hands-on exercises in class, and the homework assignments and the project address the material covered in class, good attendance is essential for doing well in this class.

Extension Policy

If you turn in your assignment late and we have not agreed on an extension beforehand, expect points to be deducted. Extensions will be considered on a case-by-case basis. I urge you to let me know if you are in need of an extension, such that we can make sure that you get the time necessary to complete the assignments.

If an extension has not been agreed on beforehand, then for assignments, by default, 5 points (out of 100) will be deducted for lateness, plus an additional 1 point for every 24-hour period beyond 2 that the assignment is late.

Note that there are always some points to be had, even if you turn in your assignment late. The last day in the semester on which the class meets is the last day to turn in late assignments for grading. Homework assignment submitted after that date will not be graded.

Classroom safety and Covid-19

To help preserve our in person learning environment, the university recommends the following.

Adhere to university mask guidance.
Vaccinations are widely available, free and not billed to health insurance. The vaccine will help protect against the transmission of the virus to others and reduce serious symptoms in those who are vaccinated.
Proactive Community Testing remains an important part of the university’s efforts to protect our community. Tests are fast and free.
Visit protect.utexas.edu for more information

Academic Dishonesty Policy

You are encouraged to discuss assignments with classmates. But all written work must be your own. Students caught cheating will automatically fail the course. If in doubt, ask the instructor.

Notice about students with disabilities

The University of Texas at Austin provides upon request appropriate academic accommodations for qualified students with disabilities. Please contact the Division of Diversity and Community Engagement, Services for Students with Disabilities, 471-6259.

Notice about missed work due to religious holy days

A student who misses an examination, work assignment, or other project due to the observance of a religious holy day will be given an opportunity to complete the work missed within a reasonable time after the absence, provided that he or she has properly notified the instructor. It is the policy of the University of Texas at Austin that the student must notify the instructor at least fourteen days prior to the classes scheduled on dates he or she will be absent to observe a religious holy day. For religious holy days that fall within the first two weeks of the semester, the notice should be given on the first day of the semester. The student will not be penalized for these excused absences, but the instructor may appropriately respond if the student fails to complete satisfactorily the missed assignment or examination within a reasonable time after the excused absence.

Emergency Evacuation Policy

Occupants of buildings on The University of Texas at Austin campus are required to evacuate buildings when a fire alarm is activated. Alarm activation or announcement requires exiting and assembling outside. Familiarize yourself with all exit doors of each classroom and building you may occupy. Remember that the nearest exit door may not be the one you used when entering the building. Students requiring assistance in evacuation shall inform their instructor in writing during the first week of class. In the event of an evacuation, follow the instruction of faculty or class instructors. Do not re-enter a building unless given instructions by the following: Austin Fire Department, The University of Texas at Austin Police Department, or Fire Prevention Services office. Information regarding emergency evacuation routes and emergency procedures can be found at http://www.utexas.edu/emergency.

Senate Bill 212 and Title IX Reporting Requirements

Under Senate Bill 212 (SB 212), the professor and TAs for this course are required to report for further investigation any information concerning incidents of sexual harassment, sexual assault, dating violence, and stalking committed by or against a UT student or employee. Federal law and university policy also requires reporting incidents of sex- and gender-based discrimination and sexual misconduct (collectively known as Title IX incidents). This means we cannot keep confidential information about any such incidents that you share with us. If you need to talk with someone who can maintain confidentiality, please contact University Health Services (512-471-4955 or 512-475-6877) or the UT Counseling and Mental Health Center (512-471-3515 or 512-471-2255). We strongly urge you make use of these services for any needed support and that you report any Title IX incidents to the Title IX Office.

Adapting the class format to deal with the ongoing pandemic

Here is the plan as of September 15, 2021:

The first three weeks of class will be fully online. We meet on zoom. The links are on the class Canvas page. Email me if you cannot access it.
The following two weeks, Wednesday classes are in person, on September 22 and September 29. Monday and Friday classes are on zoom.
Starting October 4, all classes will be offered in person. All in-person classes will be zoom-streamed and zoom-recorded. You can either come in person or participate via zoom, either way is fine. We'll continue to make class zoom recordings available.

Schedule

This schedule is subject to change.

Assignments are due at the end of the day on their due date. Please submit assignments online on Canvas unless the assignment tells you otherwise.

Readings can be done either before or after class (unless noted otherwise); they are chosen to support the material covered in class.

Week 1: Aug 25, Aug 27

Wednesday: Computational semantics: an overview
- Slides
Friday: Meaning as a space in which you can walk from word to word: an introduction
- Slides

Week 2: Aug 30, Sep1, Sep 3

Monday: Continuing with the introduction to meaning spaces
- Slides (same link as Friday)
- Software we will use for class:
  - We strongly recommend installing Anaconda, as that includes Python along with almost all Python packages we need.
    - If you install anaconda, you will have to add gensim. Here is a tutorial on how to add a package to Anaconda: https://docs.anaconda.com/anaconda/navigator/tutorials/manage-packages/#installing-a-package
      Please do this, and choose to add gensim.
  - Alternatively, you can individually install:

Python: https://www.python.org/downloads/ Any version >= 3.4 should be fine.
Jupyter notebooks: https://jupyter.org/
numpy: https://numpy.org/

matplotlib: https://matplotlib.org/
NLTK: Installing NLTK itself: http://www.nltk.org/install.html You also need the NLTK data, see http://www.nltk.org/data.html
gensim

Wednesday: Pre-computed meaning spaces, and how to use them
- Jupyter notebook we'll use in class
- For pre-computed meaning spaces, see links below under "Links"
Friday: What can I do for my course project?

Week 3: Sep 6, 8, 10

Monday: Labor day
Wednesday: Using distributional spaces in cognition and lexical semantics
- Slides: also use these to get ideas for your course projects
Friday: How to make a count-based space
- - Jupyter notebook that we'll use in class
  - We'll also continue using these slides
  - An accessible introduction to count-based models is in Section 2 of this paper (but only that section)
  - Supporting material: Jurafsky and Martin ed. 3 ch. 6 up to 6.7

Week 4: Sep 13, 15, 17

Monday: How to make a count-based space and use it
- We'll continue using this Jupyter notebook
Wednesday: Continuing with count-based spaces: using matrix methods for efficiency, and doing dimensionality reduction
- Homework 1 due.
- We continue to use this Jupyter notebook and these slides
Friday: Towards prediction-based spaces. Step 1: Classification
Class on zoom.

Week 5: Sep 20, 22, 24

Monday: Towards prediction-based spaces. Step 2: Logistic regression
Class on zoom.
- Readings: Jurafsky and Martin ed. 3, chapter 5. This goes into much more detail than we did in class. You don't need to read up on details that we did not cover, but you can if you are curious.
Wednesday: Towards prediction-based spaces. Step 3: Word2vec, and how to easily make a word2vec space using gensim
Class is in person, in RLP 1.108. It is also zoom-streamed and recorded. Come in person or stay on zoom -- you decide.
- Jupyter notebook: making a word2vec space with gensim
- Readings: Jurafsky and Martin ed. 3, chapter 6, section 6.8. This goes into more detail than we did in class. You don't need to read up on details that we did not cover, but you can if you are curious.Monday: Prediction-based embeddings in practice: How to compute them, and how to test whether they are okay
Friday: Under the hood: a neural net that will compute a prediction-based space.
Class on zoom.
- Jupyter notebook for class: logistic regression models in pytorch
- More resources on pytorch:

Week 6: Sep 27, Sep 29, Oct 1

Monday: We talk in class about your project ideas.
Class on zoom.
Wednesday: Making your own neural net, continued: deeper networks.
Class is in person, in RLP 1.108. It is also zoom-streamed and recorded. Come in person or stay on zoom -- you decide.
- Jupyter notebook: an example of a deeper model (with more than one layer) in pytorch
- Homework 2 due
Friday: Making your own neural net, continued: a prediction-based space.
Class on zoom.
- Jupyter notebook: Making a prediction-based space from scratch using pytorch

Week 7: Oct 4, 6, 8

Monday: Language models. Embeddings for individual occurrences of a word: BERT and friends
Today and going forward, class is in person, in RLP 1.108. It is also zoom-streamed and recorded. Come in person or stay on zoom -- you decide.
Wednesday: Vector arithmethic: How to compute directions in space that indicate interesting "semantic directions", for example gender
- Jupyter notebook: directions in space
Friday: Topic modeling: automatically detecting underlying themes in documents
- Jupyter notebook: mini demo of topic modeling

Week 8: Oct 11, 13, 15

Monday: Structured meaning representations: word senses and ontologies
- Additional reading: Jurafsky and Martin 3rd edition, Word Senses and WordNet

- Wednesday: Structured meaning representations: semantic role labeling
  - Additional reading: Jurafsky and Martin 3rd edition, Semantic Role Labeling and Argument Structure
  - Lexical resources:
- Friday: Structured meaning representations: Semantic roles continued. Then, coreference
  - Additional reading: Jurafsky and Martin 3rd edition, Coreference resolution

Week 9: Oct 18, 20, 22

- Monday: Structured meaning representations: Events: arguments, subevents, coreference
  - Additional reading: slideset on Canvas

- - Wednesday: An introduction to logic
    - Initial project description due.
    - Material to use in class: a logic puzzle. Also, logical formalizations -- don't look at them before class: logic puzzle with logical formalization, the same as text file, and a Python script that tries to derive the solution of the logic puzzle.
  - Friday: Propositional logic
    - Background reading:
      (Note that these tutorials use a sligthly different notation.)
      - an online tutorial on propositional logic
      - another tutorial on propositional logic
      - or, if you want an in-depth textbook, check out L.T.F. Gamut volume 1. The introduction to propositional logic starts on page 28.

Week 10: Oct 25, 27, 29

- - Monday: Logic and automatic inference
    - Here is an online demo of Robinson Resolution: https://logictools.org/prop.html
      To use it, choose "Propositional logic" from the riders on top. Then in the first row of choices below the example window, choose using: "resolution:naive" showing "html trace". Then hit the "Solve" button.

Wednesday: First-order logic
- Background reading:
  - syntax and semantics of first-order logic, formally
  - For an extended introduction, again look at L.T.F. Gamut volume 1. The discussion of first-order logic starts on page 65.
Friday: "Translating" natural language sentences to logic
- Some sentences to translate to logic
- Another set of sentences, and a world
- Please only look at this after class: solutions to the first and second set of sentences

Week 11: Nov 1, 3, 5

Monday: Translating natural language to logic, continued
Wednesday: Translating natural language to logic, continued
- Homework 3 due.
Friday: Translating natural language to logic, continued

Week 12: Nov 8, 10, 12

Monday: Semantics construction: automatically constructing a logical representation for a sentence. Lambda calculus: lego for semantics construction

Wednesday: More lambda calculus. We go through some examples together.
- Intermediate project report due.
Friday: Final piece of lambda calculus: transitive verbs. Then: Semantics construction in practice with the Natural Language Toolkit
- Notebook: semantics construction with NLTK

Week 13: Nov 15, 17, 19

- Monday: Semantics construction in practice with the Natural Language Toolkit, continued
- Wednesday: Back to structured meaning representations: Abstract Meaning Representations, and the Groningen MeaningBank
  - We'll look at these examples:
- Friday: More on structured meaning representations.

Week 14: Nov 22

Monday: Knowledge graphs
- Homework 4 due.
Wednesday, Friday: Thanksgiving break

Week 15: Nov 29, Dec 1, Dec 3

Monday: Knowledge graphs
Wednesday: Knowledge graphs: construction through Information Extraction, extension through link prediction (using embeddings of graph nodes and edges!)
- Background reading:
  - J&M 3rd edition, chapter on information extraction: examples on pages 2-4, algorithms for relation extraction after that (this goes far beyond what we discuss in class)
  - TransE: the original paper is here (sorry, this is rough reading. I couldn't find any decent tutorial)
Friday: Project presentations
- 10:00 John Steinman
- 10:07 Aubrey Hinchman and Grey Sandstrum
- 10:14 Sebastian Mancha and Grace Huang
- 10:21 Sydney Willett and Kabin Moon
- 10:28 Riddhi Bhave and Pooja Chivukula

Week 16: Dec 6

Monday: Project presentations, and some final words
- 10:00 Alyssa Cantu
- 10:07 Matthew Pabst
- 10:14 Fei Guo
- 10:21 Katrina Gavura and Nicolette Warren
- 10:28 Misty Peng and Manaasa Darisi
- 10:35 Vittoria Byland

Final report due: Friday December 10, end of day

Course Project

Course projects should be done by teams of 2 students. Project groups consisting of 1 or 3 students are possible only with prior approval of the instructor.

Course project requirements

Initial project description

This is a 1-2 page document (single-spaced, single column) that describes what your project will be about. It needs to contain the following information:

- Research questions: What are the main questions that you want to answer, the main language phenomena you want to address, or the main ideas you want to explore?
- Method: What distributional model will you use, or what kinds of rules are you planning to state? Be as detailed as you can. (Yes, I know you will not have worked out every detail at this point, but strive to work out as many as you can.)
- Data: If you do a distributional project, it is vital that you figure out as early as possible what data you can use to learn your model. Is there enough data? Is it freely available? Do you have to contact someone to get it?

Intermediate report

This is a 1-2 page document (single-spaced, single column) that describes what the status of your project is at this point. This is a revised version of your initial project description. It needs to contain the following information:

- Research questions: any changes?
- Method: any changes?
- Status:
  - Describe the data that was obtained: source, size, anything else that is relevant
  - Describe at least two (smaller, and preliminary) concrete results that you have at this point

You also need to take into account the feedback that you got on the Initial project description.

Short presentation

This is a short presentation to the class. You should discuss:

- Research questions/linguistic phenomena/main ideas you wanted to model
- Why is this relevant? (Spend a lot of time on the research questions and their relevance. Describing the big picture is important!)
- Data, if you are using a data-driven approach: source, size
- Results

You will need to prepare slides for this, which you submit to the instructor ahead of time.

Final report

This is a 5-6 page document (single-spaced, single column) that describes the results of your project. This is a revised version of your intermediate project description. It needs to contain the following information:

- Research questions/linguistic phenomena covered/main ideas pursued
- Data: source, size, other relevant statistics
- Method
- Findings

If you build on previous work, you need to discuss it, and give references. Published papers (at conferences, in journals) go into the references list at the end of the paper. Links to blog posts and the like go in a footnote. Also, links to websites containing data go in a footnote, not in the references list.

You need to take into account the feedback that you got on the Initial project description and Intermediate report.

Course project ideas

Context-based vectors/embeddings for words

Use pre-trained:

Comparing general and specific terms (hyponyms and hypernyms) in vector spaces
Exploring prejudice in vector spaces, and possibly removing it
- In this context, check out the paper Man is to Computer Programmer as Woman is to Homemaker? Debiasing word embeddings.
Exploring analogy reasoning in vector spaces
Make vectors for occurrences of words, and group (cluster) them into senses
What clusters of words (clustered by vector representations) are used a lot in a politician's speech, or in top-10 songs?
Comparing general and specific words (like "animal" versus "dog") in vector spaces: can you detect which specific words go with which general words? How well does this work in different spaces?

Compute your own:

How do people use emojis? That is, what are the context vectors of emojis?
Compute vector representations from two different time periods: How have word meanings changed? Or, how has the discourse/use around the words changed?
Compute vector representations from two different corpus collections, and do the same kind of analysis

Topic modeling for documents

Automatically determine topics (word groups) that occur a lot in a collection of documents. Can you see patterns in which documents tend to have which topics?

Structured meaning representations

Build a system for automatic word sense disambiguation or semantic role labeling using machine learning
Build a system that automatically identifies events in text, using a tool that gives you the syntactic structure of a sentence and using rules that identify events in that syntactic structure
Build a system that automatically identifies medication names, or illness names, in medical texts

Links and additional readings

Tutorials and texts about distributional models

Jurafsky and Martin, chapter 6 of their upcoming 3rd edition of Speech and Language Processing
Count-based distributional representations:
- Turney and Pantel overview article
- Erk overview article (complete thing available at the UT library)
Embeddings, based on neural networks/ deep learning:
Probabilistic approaches: Latent Dirichlet Allocation (LDA), topic models:
- Mark Steyvers & Thomas Griffiths (2007), Probabilistic topic models
- Kevin Knight (2009), Bayesian Inference with Tears
- Alexander Koller, LDA with pirates and ninjas, and non-parametric Bayesian models with pirates and ninjas

"Semantic directions" in distributional spaces

Readings:

Vector offsets and analogies: visualization at https://lamyiowce.github.io/word2viz/, a simple primer with python code at https://towardsdatascience.com/how-to-solve-analogies-with-word2vec-6ebaf2354009
Bolukbasi et al on biases in word embeddings and de-biasing: https://proceedings.neurips.cc/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html
Many thanks to Fei Guo for pointing out a paper on the "geometry of culture" by Kozlovski et al: https://journals.sagepub.com/doi/full/10.1177/0003122419877135
This uses the same idea as Bolukbasi et al: Collect some words that together characterize a "semantic direction", for example "poor", "poverty", "inexpensive", "impoverished" for a direction of poverty. Sum them up to get their average direction. Analyze other words for how strongly they have this "semantic direction" in them.
Grand et al use the idea of semantic directions to order words on a scale, for example animals on a scale from small to large. They argue that if this can be done with word embeddings, this shows that people can in principle learn such information from text alone because people, too, are sensitive to statistical regularities in text data: https://arxiv.org/abs/1802.01241

Pre-computed spaces ready for use

For English:
- Start here: a collection of many pre-computed spaces

Word2Vec
GloVE
TypeDM
For many languages: FastText vectors for 157 languages

Links that may give you ideas for course projects

SemEval, a series of workshops on semantics-related tasks: They come up with 10-ish new tasks every year, and they offer data for it, so this may be an easy way to get data
Automatically getting a structured meaning representation: See the Semantic Role Labeling demo and the Open Information Extraction demo from AllenNLP.
How can we use computational tool to answer linguistic questions? For ideas (with much too large projects!) for what others have done, see the Society for Computing in Linguistics conferences

Logic-based semantics in the Natural Language Toolkit

NLTK book chapter: Analyzing semantics in the Natural Language Toolkit
A paper by the authors of the semantics part of NLTK: Dan Garrette and Ewan Klein (2009), An Extensible Toolkit for Computational Semantics. Proceedings of IWCS.

NLTK howto: Semantics
Discourse representation (DRT) in NLTK: the package
NLTK API

Some freely available corpora

Data that can be used to build distributional models:

The WaCKy corpora, including UKWaC (English web text, 2B words), Wackypedia (an English Wikipedia dump, 2B words), web corpora for French, German, and Italian. Ask me about a parsed version of UKWaC and Wackypedia if you need syntactic analysis.

Catalan/Spanish/English wikicorpus

Structured semantic annotation:

The Groningen Parallel Meaning Bank
Abstract meaning representations

Systems and online demos for logic-based semantic analysis

LinGO English Resource Grammar: demo that generates Minimal Recursion Semantics representations
Theorem proving (automatic inference): Prover9 and Mace4

Systems and online demos for structured semantics and syntactic preprocessing