LSA summer institute 2023: Distributional spaces and Word Meaning
Distributional spaces represent the meanings of words (or word occurrences) as points in some "semantic space" where similar meanings are close together, and different meanings are further apart.
This semantic space is computed from data, often just from written texts produced by many people. So we can view the semantic space as a kind of "compressed corpus", a record of utterances from many speakers. The structure of the space is determined by regularities in word co-occurrences across the utterances. We can probe this space to ask questions about lexical semantics.
In this course, we focus on distributional spaces as a "compressed corpus", and how we can use it for linguistic analyses. We discuss the ideas and the mathematics behind these models, recent use cases in linguistics, and we talk about a best practice in using these models, as far as it already exists.
An introduction to distributional models
This is a general introduction to the main underlying ideas of distributional models.
Particularly relevant readings (for after class, to reinforce what we discussed):
Alessandro Lenci (2008), “Distributional semantics in linguistic and cognitive research”, Italian Journal of Linguistics 20(1) https://linguistica.sns.it/RdL/20.1/ALenci.pdf
Gemma Boleda (2020), "Distributional Semantics and Linguistic Theory", Annu. Rev. Linguist. 6:213–34.
Katrin Erk (2012), "Vector space models of word meaning and phrase meaning: a survey." Language and Linguistics Compass 6(10), 635-653 (old paper, but good for the count-based model basics)
Methods for working with distributional models
The main methods for working with distributional models (focusing on methods that work both for word vectors and word token vectors)
Recommended readings: no single paper that stands out, but check the slides for many recommendations.
Using neural networks to compute distributional models
How do prediction-based models work, both at the word type level and at the word token level?
Helpful readings (recommended after class, to reinforce what we discussed):
Dan Jurafsky and Jim Martin "Speech and Language Processing" 3rd edition, chapter 5: logistic regression
Using word token embeddings
Recent neural models give us access to embeddings (vectors) not just for a word, but for a word in a particular sentence context (a word token). What can we do with that? What new kinds of studies are now being done?
Recommended readings (recommended after class, to review what we discussed):
An overview of where we are at with word token models and semantics:
Ellie Pavlick 2022, Semantic structure in deep learning
Readings on technical problems with the semantic space in Transformer language models:
Kawin Ethayarajh 2019, "How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings"
William Timkey and Maarten van Schijndel 2021, "All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality"
Readings on embedding clusters and interpretable features:
Gabriella Chronis and Katrin Erk 2020. When is a bishop not like a rook? When it's like a rabbi! Multi-prototype BERT embeddings for estimating semantic relationships.
Gabriella Chronis, Kyle Mahowald and Katrin Erk 2023. A Method for Studying Semantic Construal in Grammatical Constructions with Interpretable Contextual Embedding Spaces.
The nature of meaning, and whether to really "understand" you need a denotation, a world:
Extra notebooks
Here are some additional notebooks that could be useful, though we won't have time to go through them in class. There are:
a detailed walk-through on how to compute a count-based distributional space from scratch, including computing cosine similarity, transforming counts to Pointwise Mutual Information weights, and visualization: notebook.
two notebooks on how to build a simple neural network in pytorch. The first one demonstrates how to make a logistic regression model, the second demonstrates how to build a very simple word2vec model.