Demo: The building blocks of a distributional model

The whole script, along with a very small and a small collection of data (demo_corpora_mini, demo_corpora_small), is here.

Step 1: Choose a corpus

  • More data is usually better because you get more stable estimates. But more data creates more processing headaches, both because of long processing times and large amounts of memory needed.
  • Genre matters! Dekang Lin, when computing a model only from newspaper data, got "captive" as the word that was overall most similar to "westerner", and vice versa. He also got "Republican" and "terrorist" as very close neighbors of "adversary" (though it is not clear whether they were found as synonyms or antonyms, as distributional models are not good at distinguishing those two.) 

For demonstration purposes, we choose a couple of gothic novels from Project Gutenberg, available on Canvas in two versions: short (for in-class demonstration) and full set (about 1 million words).

Here is code that reads in one file of this collection at a time:

# Replace this directory with one on your own machine

demo_dir = "/Users/katrinerk/Desktop/demo_corpora_short/"

import os

# We iterate over the corpus files.

# os.listdir lists the names of all files in a directory
for filename in os.listdir(demo_dir):
    if filename.endswith("txt"):
        print("reading file", filename)
        text = open(os.path.join(demo_dir, filename)).read()
           


Step 2: Choose preprocessing steps

  • You probably want to lowercase all words.
  • Do you want to keep wordforms, or lemmatize them, or use a stemmer? Wordforms will have counts dispersed over different forms of the same word. Lemmatization is more accurate than stemming, but stemming is faster.
  • Do you want to apply part-of-speech tagging, for example to distinguish object-N and object-V? 
  • Do you want to eliminate some words up front? Stopwords? Or maybe all words except a few class of content words, typically NN (nouns), JJ (adjectives), VB (verbs), RB (adverbs)?

Here is code that splits a string into words and lowercases. You have done further preprocessing in homework 1.


##
# NLTK processing objects

import nltk

import string

def preprocess(s):
    # split up into words, lowercase, remove punctuation at beginning and end of word
    return [ w.lower().strip(string.punctuation) for w in s.split() ]

# or like this:
# def preprocess(s):
#     words =  [ ]
#     for w in s.split():
#         word = w.lower()
#         word = word.strip(string.punctuation)
#         words.append(word)
#     return words



# use the function like this:

preprocess("This is a test.")



Step 3: Define a frequency cutoff

Infrequent words are not useful for distributional models. An infrequent target word will have too low counts on all dimensions to get good similarity estimates. An infrequent context item will not contribute much to the representation of any targets. Also, the more different context items you count, the larger your table of counts. The table of counts can quickly become unwieldy, and using a frequency cutoff is a good way of keeping it manageable.

We go through the whole corpus, and count the frequency of all words in it. Then we retain only the N most frequent words, for example N=2000 for our small demo corpus, or N=10,000 for a larger corpus. We use these words as both the list of target words and the list of context items.

The following code does the counting and retains only the N most frequent words.


#####################
# Counting words:
# We want to make a list of the N most frequent words in our corpus

import os

def do_word_count(demo_dir, numdims):
    # we store the counts in word_count
    # using NLTK's FreqDist
    word_count = nltk.FreqDist()
   
    # We iterate over the corpus files
    for filename in os.listdir(demo_dir):
        if filename.endswith("txt"):
            print("reading file", filename)
            text = open(os.path.join(demo_dir, filename)).read()
            word_count.update(preprocess(text))
           
    # keep_wordfreq is a list of (word, frequency) pairs
    keep_wordfreq = word_count.most_common(numdims)
    keep_these_words = [ w for w, freq in keep_wordfreq ]
    # print("Target words:\n", keep_these_words, "\n")
   
    return keep_these_words

# or like this, without FreqDist:
# def do_word_count(demo_dir, numdims):
#     word_count = { }

#     for filename in os.listdir(demo_dir):
#         if filename.endswith("txt"):
#             print("reading file", filename)
#         text = open(os.path.join(demo_dir, filename)).read()
#         for taggedword in preprocess(text):
#             if taggedword not in word_count:
#                 word_count[ taggedword ] = 0
#             word_count[ taggedword ] += 1
#
#     def map_word_to_count(word): return word_count[ word ]
#     keep_these_words = sorted(word_count.keys(), key = map_word_to_count)[:numdims]
#    
#     # print("Target words (and also dimensions):\n", keep_these_words, "\n")
#
#     return keep_these_words



##
# run this:
def test_wordcount():
    print("Doing a frequency-based cutoff: keeping only the N most frequent context words.")
   
    # with 10 dimensions
    keepwords = do_word_count(demo_dir, 10)
    print("Keeping only 10 dimensions, then I get:", keepwords, "\n")

    # with 100 dimensions
    keepwords = do_word_count(demo_dir, 100)
    print("Keeping 100 dimensions, then I get:", keepwords, "\n")


Step 4: Choose a context window

We want to make a table of counts for each target word. This table of counts will show how often each context item co-occurred with the target word. But what does it mean to co-occur?
  • A "narrow context window" is one that counts 2 or 3 words on either side of the target, or even all words in the sentence where the target occurs. If you count 2 or 3 words on either side of the target: Do you want to cross sentence boundaries?
  • A "wide context window" is one that counts 20 or 50 words on either side of the target, ignoring sentence boundaries, or maybe all words that occur in the same document as the target. (That only makes sense if you have lots of documents.)
  • If you have a corpus that has been analyzed for sentence structure by a syntactic parser, you can also say that your context window is all the parse tree snippets that link directly to your target word in the parse of the sentence. (But we don't do that here.)

Here is a function that counts co-occurrences in a window of 2 words on either side of the target. It takes as input a sequence of words (supposed to be a single sentence), and returns a list of pairs (word1, word2) where word1 is a target and word2 is a context item that co-occurrence with the target in the relevant window.


###
# identifying context words for a narrow context window of 2 words on either side
# of the target:
# takes as input a sequence of words for counting.
# For each word in the sequence, make 4 pairs:
# (word, left neighbor of word), (word, left neighbor of left neighbor of word),
# (word, right neighbor of word), (word, right neighbor of right neighbor of word),
# so pair each word with all its context items in the context window.
# Return a list of these pairs.
def co_occurrences(wordsequence):
    target_context_pairs = [ ]

    # for a sequence of length N, count from 0 to N-1
    for index in range(len(wordsequence) - 1):
        # count that word[index] as a target co-occurred with the next word as a context item,
        # and vice versa
        target_context_pairs.append( (wordsequence[index], wordsequence[index+1]) )
        target_context_pairs.append( (wordsequence[index+1], wordsequence[index]) )

        if index + 2 < len(wordsequence):
            # there is a word 2 words away
            # count that word[index] as a target co-occurred with the but-next word as a context item,
            # and vice versa
            target_context_pairs.append( (wordsequence[index], wordsequence[index+2]) )
            target_context_pairs.append( (wordsequence[index+2], wordsequence[index]) )

    return target_context_pairs

###
# run this to test co-occurrences
def test_cooccurrences():
    text = """You will not find Dr. Jekyll; he is from home," replied Mr. Hyde"""
    print("Testing the function that pairs up each target word with its context words.")
    print("Original text:", text, "\n")

    words = preprocess(text)
    cooc = co_occurrences(words)
    print("These are the target/context pairs:", cooc, "\n")



Step 5: Do the actual counting

We store the counts in a table with one row for each target word, and one column for each context item. If our list of target words (which is also our list of context items) were "apple", "grass", "truck", this table could look like this:

  apple grass truck
 apple 0 12 9
 grass 12 0 1
 truck 9 1 0

It is symmetric, as "apple" as a target occurs with "grass" as a context item just as often as "grass" as a target occurs with "apple" as a context item. (In principle, we would only have to store half the matrix because it is symmetric, but we don't make use of that here to keep the code simple.)

The numpy package has a data type "array" that fits our purposes nicely. See the tutorial for more info on numpy arrays.


##
# We will need the function make_word_index below.
# It maps each word that we want to keep around as a context item
# to an index, which will be its place in the table of counts,
# that is, its dimension in the space
def make_word_index(keep_these_words):
    # make an index that maps words from 'keep_these_words' to their index
    word_index = { }
    for index, word in enumerate(keep_these_words):
        word_index[ word ] = index

    return word_index

import numpy

# read all files in demo_dir, and compute a counts vector
# of length numdims for each relevant word.
# The function takes as input also a mapping word_index from relevant words
# to their dimension, from which we derive a set relevant_words.
# This function reads the texts one sentence at a time.
# In each sentence, it identifies context words in the window
# defined by co_occurrences(), and stores them if both the target
# and its context words are relevant_words
def make_space(demo_dir, word_index, numdims):

    # relevant words: those that have an entry in word_index
    relevant_words = set(word_index.keys())

    # space: a mapping from relevant_words to an array of integers (raw counts)
    space = { }
    # fill the space with all zeros.
    for word in relevant_words:
        space[ word ] = numpy.zeros(numdims, dtype = numpy.int)

    ##
    # Design decision: We want to take sentence boundaries into account
    # when computing distributional representations.
    # So we need to detect sentence boundaries first.
    sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

    # We iterate over the corpus files
    # and count word co-occurrences in a window of 2
    for filename in os.listdir(demo_dir):
        if filename.endswith("txt"):
            print("reading file", filename)
            # read the text
            text = open(os.path.join(demo_dir, filename)).read()
            # split the text into sentences
            sentences = sent_detector.tokenize(text)
            # process one sentence at a time
            for sentence in sentences:
                words = preprocess(sentence)

                # determine pairs of co-occurrences to count,
                # and store them in the matrix
                for target, cxitem in co_occurrences(words):
                    # are these two words relevant?
                    if target in relevant_words and cxitem in relevant_words:
                        # what is the row for this context item?
                        cxitem_index = word_index[ cxitem]
                        # now count
                        space[ target ][cxitem_index] += 1


    return space

###
# run this
def test_space():
    numdims = 50
    # which words to use as targets and context words?
    ktw = do_word_count(demo_dir, numdims)
    # mapping words to an index, which will be their column
    # in the table of counts
    wi = make_word_index(ktw)
    words_in_order = sorted(wi.keys(), key=lambda w:wi[w])
   
    print("word index:")
    for word in words_in_order:
        print(word, wi[word], end= " ")
    print("\n")

    space = make_space(demo_dir, wi, numdims)
   
    print("some words from the space")
    for w in words_in_order[:10]:
        print(w,  space[w], "\n")


Do you need a sparse matrix?

A sparse matrix is a data type that only stores the non-zero entries in a table. If we have a matrix with lots of 0's (which we do, for the counts), this will save a lot of memory.
It is somewhat slower in processing than a normal (dense) matrix though.

The scipy package has not one, but several data types for sparse matrices. While we count co-occurrences, the matrix is slowly filled. For this, the scipy.sparse.lil_matrix type can be used, or scipy.sparse.coo_matrix. In the lil_matrix, updating counts is not as fast as in the coo_matrix, but the format is more straightforward. If you wanted to use a lil_matrix, the only thing you would need to change is the initialization of the matrix:

import scipy


# make a completely empty matrix
# in which we want to store integers (that's the dtype)
space = scipy.sparse.lil_matrix((rows, cols), dtype = numpy.int)

Then the counts in the matrix can be updated in the same way as in the numpy array above. For the coo_matrix, adding counts is less straightforward. You find a description of that matrix at https://scipy-lectures.github.io/advanced/scipy_sparse/coo_matrix.html .

After all the counts have been recorded in the sparse matrix, you should convert it to a different scipy sparse matrix data type for doing math, as csr_matrix is much faster for that:

space = space.tocsr()

See the documentation, the scipy tutorial, and these lecture notes for more information on scipy sparse matrices.

Step 6: Transform counts to association weights

All words will have high co-occurrence counts with the most frequent context items. In our demo dataset, these are i-PR, the-DT, man-NN, on-CD, could-MD. This will falsely inflate all our similarity estimates. What we want to know instead is how strongly a target word is associated with a context item: Does it appear with the context item more often than we could expect at random? Less often? About as often as we would expect?

There are multiple options for computing degree of association:

  • tf-idf (term frequency / inverse document frequency)
  • pointwise mutual information (PMI)
  • positive mutual information (PPMI): just change negative PMI values to zero
  • local mutual information (LMI)

We do PPMI here. The PMI of a target word t and context item c is defined as:

 

                    P(t, c)

PMI(t, c) = log ------------------

                    P(t) P(c)


All the probabilities are computed from the table of counts. We need:

  • #(t, c): the co-occurrence count of t with c
  • #(_, _): the sum of counts in the whole table, across all targets
  • #(t, _): the sum of counts in the row of target t
  • #(_, c): the sum of counts in the column of context item c


Then we have:

  • P(t, c) = #(t, c) / #(_, _)
  • P(t) = #(t, _) / #(_, _)
  • P(c) = #(_, c) / #(_, _)

Here is the code for computing PPMI:



#########
# transform the space using positive pointwise mutual information

# target t, dimension value c, then
# PMI(t, c) = log ( P(t, c) / (P(t) P(c)) )
# where
# P(t, c) = #(t, c) / #(_, _)
# P(t) = #(t, _) / #(_, _)
# P(c) = #(_, c) / #(_, _)
#
# PPMI(t, c) =   PMI(t, c) if PMI(t, c) > 0
#                0 else
def ppmi_transform(space, word_index):
    # #(t, _): for each target word, sum up all its counts.
    # row_sums is a dictionary mapping from target words to row sums
    row_sums = { }
    for word in space.keys():
        row_sums[word] = space[word].sum()

    # #(_, c): for each context word, sum up all its counts
    # This should be the same as #(t, _) because the set of targets
    # is the same as the set of contexts.
    # col_sums is a dictionary mapping from context word indices to column sums
    col_sums = { }
    for index in word_index.values():
        col_sums[ index ] = sum( [ vector[ index ] for vector in space.values() ])

    # sanity check: row sums same as column sums?
    for word in space.keys():
        if row_sums[word] != col_sums[ word_index[word]]:
            print("whoops, failed sanity check for", word, row_sums[word], col_sums[word_index[word]])
   
    # #(_, _): overall count of occurrences. sum of all row_sums
    all_sums = sum(row_sums.values())

    # if all_sums is zero, there's nothing we can do
    # because we then cannot divide by #(_, _)
    if all_sums == 0:
        print("completely empty space, returning it unchanged")
        return space

    # P(t) = #(t, _) / #(_, _)
    p_t = { }
    for word in space.keys():
        p_t[ word ] = row_sums[ word ] / all_sums

    # P(c) = #(_, c) / #(_, _)
    p_c = { }
    for index in col_sums.keys():
        p_c[ index ] = col_sums[ index ] / all_sums

    # ppmi_space: a mapping from words to vectors of values
    ppmi_space = { }
    # first we map from words to values P(t, c)
    for word in space.keys():
        ppmi_space[ word ] = space[ word ] / all_sums
    # divide each entry by P(t)
    for word in space.keys():
        if p_t[ word ] == 0:
            # I haven't seen this word ever, so I cannot
            # divide by P(t). But the whole entry for this word
            # should be 0's, so leave as is.
            pass
        else:
            ppmi_space[ word ] = ppmi_space[ word ] / p_t[ word ]
    # divide each entry by P(c)
    for index in p_c.keys():
        if p_c[ index ] == 0:
            # I haven't seen this context item ever,
            # so I cannot divide by P(c).
            # But every target word will have an entry of 0.0
            # on this column, so nothing more to do.
            pass
        else:
            for word in space.keys():
                ppmi_space[ word ][index] = ppmi_space[ word][index] / p_c[ index ]
               
    # take the logarithm, ignore entries that are zero
    for word in space.keys():
        with numpy.errstate(divide="ignore",invalid="ignore"):
            ppmi_space[ word ] = numpy.log(ppmi_space[ word ])
           

    # turn negative numbers to zero
    for word in space.keys():
        ppmi_space[word] = numpy.maximum(ppmi_space[word], 0.0)

    return ppmi_space

###
# run this:
def test_ppmispace():
    numdims = 50
    # which words to use as targets and context words?
    ktw = do_word_count(demo_dir, numdims)
    # mapping words to an index, which will be their column
    # in the table of counts
    wi = make_word_index(ktw)
    words_in_order = sorted(wi.keys(), key=lambda w:wi[w])
   
    print("word index:")
    for word in words_in_order:
        print(word, wi[word], end=" ")
    print("\n")

    space = make_space(demo_dir, wi, numdims)
    ppmispace = ppmi_transform(space, wi)
   
    print("some raw counts vectors and some ppmi vectors")
    for w in words_in_order[:10]:
        print("---------", "\n", w)
        print("raw", space[w])
        # for the PPMI space, we're rounding to 2 digits after the floating point
        print("ppmi", numpy.round(ppmispace[w], 2), "\n")
       



Step 7: Dimensionality reduction

Dimensionality reduction is a method that does exactly this: It takes a space where each word has a vector of, say, 10,000 dimensions and reduces it to a space where each word has a vector of something like 300 or 500 dimensions, making the space more manageable.

The new dimensions can be seen as groupings (soft clusterings) of the old dimensions, or as latent semantic classes underlying the old dimensions. A popular choice of dimensionality reduction method is singular value decomposition (SVD). SVD involves representing a set of points in a different space (that is, through a new set of dimensions) in such a way that it brings out the underlying structure of the data.

Here is how we can do this in Python.


#################
# transforming the space using singular value decomposition.
#
def svd_transform(space, originalnumdimensions,keepnumdimensions):
    # space is a dictionary mapping words to vectors.
    # combine those into a big matrix.
    spacematrix = numpy.empty((len(space.keys()), originalnumdimensions))

    rowlabels = sorted(space.keys())

    for index, word in enumerate(rowlabels):
        spacematrix[index] = space[word]

    # now do SVD
    umatrix, sigmavector, vmatrix = numpy.linalg.svd(spacematrix)

    # remove the last few dimensions of u and sigma
    utrunc = umatrix[:, :keepnumdimensions]
    sigmatrunc = sigmavector[ :keepnumdimensions]

    # new space: U %matrixproduct% Sigma_as_diagonal_matrix  
    newspacematrix = numpy.dot(utrunc, numpy.diag(sigmatrunc))

    # transform back to a dictionary mapping words to vectors
    newspace = { }
    for index, word in enumerate(rowlabels):
        newspace[ word ] = newspacematrix[index]
       
    return newspace

####
### run this:
def test_svdspace():
    numdims = 50
    # which words to use as targets and context words?
    ktw = do_word_count(demo_dir, numdims)
    # mapping words to an index, which will be their column
    # in the table of counts
    wi = make_word_index(ktw)
    words_in_order = sorted(wi.keys(), key=lambda w:wi[w])
   
    print("word index:")
    for word in words_in_order:
        print(word, wi[word], end=" ")
    print("\n")

    space = make_space(demo_dir, wi, numdims)
    ppmispace = ppmi_transform(space, wi)
    svdspace = svd_transform(ppmispace, numdims, 5)
   
    print("some vectors")
    for w in words_in_order[:10]:
        print("--------------", "\n", w)
        print("raw", space[w])
        # for the PPMI and SVD spaces, we're rounding to 2 digits after the floating point
        print("ppmi", numpy.round(ppmispace[w], 2), "\n")
        print("svd", numpy.round(svdspace[w], 2), "\n")

       


Step 8: Computing similarity

The main way in which distributional vectors are used is for estimating similarity between words. The central idea is that words that appear in similar contexts tend to be similar in meaning. So what we need to do to estimate word similarity from distributional vectors is to compute a similarity measure that determines how similar the vectors of two words are. There are many similarity measures, but the one that is used the most is cosine similarity. If we consider the vector of a word as an arrow from the origin, then two words should be similar if their vectors go in roughly the same direction. Cosine similarity measures this as the cosine of the angle between the two vectors.


##
# similarity measure: cosine
#                           sum_i vec1_i * vec2_i
# cosine(vec1, vec2) = ------------------------------
#                        veclen(vec1) * veclen(vec2)
# where
#
# veclen(vec) = squareroot( sum_i vec_i*vec_i )
#

import math

def veclen(vector):
    return math.sqrt(numpy.sum(numpy.square(vector)))

def cosine(word1, word2, space):
    vec1 = space[ word1 ]
    vec2 = space[word2]

    veclen1 = veclen(vec1)
    veclen2 = veclen(vec2)

    if veclen1 == 0.0 or veclen2 == 0.0:
        # one of the vectors is empty. make the cosine zero.
        return 0.0

    else:
        # we could also simply do:
        # dotproduct = numpy.dot(vec1, vec2)
        dotproduct = numpy.sum(vec1 * vec2)

        return dotproduct / (veclen1 * veclen2)


#######
# run this:
def test_cosine():
    # this time we're not removing any words
    numdims = 100
    # which words to use as targets and context words?
    ktw = do_word_count(demo_dir, numdims)
    # mapping words to an index, which will be their column
    # in the table of counts
    wi = make_word_index(ktw)
    words_in_order = sorted(wi.keys(), key=lambda w:wi[w])
   
    space = make_space(demo_dir, wi, numdims)
    ppmispace = ppmi_transform(space, wi)
    svdspace = svd_transform(ppmispace, numdims, 5)
   
    print("some cosines")
    print("'lawyer' and 'lean':")
    print("raw", cosine("lawyer", "lean", space))
    print("ppmi", cosine("lawyer", "lean", ppmispace))
    print("svd", cosine("lawyer", "lean", svdspace))
   
    print("'a' and 'and':")
    print("raw", cosine("a", "and", space))
    print("ppmi", cosine("a", "and", ppmispace))
    print("svd", cosine("a", "and", svdspace))

    print("'friendly' and 'cold':")
    print("raw", cosine("friendly", "cold", space))
    print("ppmi", cosine("friendly", "cold", ppmispace))
    print("svd", cosine("friendly", "cold", svdspace))
   


And here is a function that uses cosine to compute the most similar word to a given target word. It simply computes its cosine similarity to all other words in the collection:


####################
# finding the word most similar to a given target
def most_similar_to(word1, space):

    sims = [ (word2, cosine(word1, word2, space)) for word2 in space.keys() if word2 != word1 ]

    return sorted(sims, key = lambda p:p[1], reverse=True)


#############################
# run this:
def test_mostsimilar():
    # this time we're not removing any words
    numdims = 100
    # which words to use as targets and context words?
    ktw = do_word_count(demo_dir, numdims)
    # mapping words to an index, which will be their column
    # in the table of counts
    wi = make_word_index(ktw)
    words_in_order = sorted(wi.keys(), key=lambda w:wi[w])
   
    space = make_space(demo_dir, wi, numdims)
    ppmispace = ppmi_transform(space, wi)
    svdspace = svd_transform(ppmispace, numdims, 5)
   
    print("ten most similar to 'friendly':")
    print("raw", most_similar_to("friendly", space)[:10])
    print("ppmi", most_similar_to("friendly", ppmispace)[:10])
    print("svd", most_similar_to("friendly", svdspace)[:10])         



The whole script in one go

Here are all the pieces from above, put together into a single script. There is an additional function testall(), which runs all the functions on a mini-corpus of maybe 5-6 sentences.


###############
# Demo: The building block of a distributional model
# Katrin Erk, September 2016
# This code is meant to demonstrate a simple distributional model.
# It is *not* optimized th deal with large amounts of data!
#
# Suggested use:
# Load file into the Idle shell, choose "Run module",
# then run the test...() predicate you are interested in.
# But make sure to change the name of the demo directory first.

# the corpus:
# a collection of more-or-less gothic tales from Project Gutenberg
demo_dir = "/Users/katrinerk/Desktop/demo_corpora_mini/"


##########
# Preprocessing:
# Design decision:
# We filter away stopwords,
# add part-of-speech tags,
# and reduce words to stems

import nltk
import string

# input: a string that may contain multiple words.
# output: a list of strings, preprocessed
def preprocess(s):
    # split up into words, lowercase, remove punctuation at beginning and end of word
    return [ w.lower().strip(string.punctuation) for w in s.split() ]

# or like this:
# def preprocess(s):
#     words =  [ ]
#     for w in s.split():
#         word = w.lower()
#         word = word.strip(string.punctuation)
#         words.append(word)
#     return words


##
# Run this:
def test_preprocess():
    print("Preprocessing demo:\n", preprocess("This is a test sentence, which contains some punctuation."))

#####################
# Counting words:
# We want to make a list of the N most frequent words in our corpus

import os

def do_word_count(demo_dir, numdims):
    # we store the counts in word_count
    # using NLTK's FreqDist
    word_count = nltk.FreqDist()
   
    # We iterate over the corpus files
    for filename in os.listdir(demo_dir):
        if filename.endswith("txt"):
            print("reading file", filename)
            text = open(os.path.join(demo_dir, filename)).read()
            word_count.update(preprocess(text))
           
    # keep_wordfreq is a list of (word, frequency) pairs
    keep_wordfreq = word_count.most_common(numdims)
    keep_these_words = [ w for w, freq in keep_wordfreq ]
    # print("Target words:\n", keep_these_words, "\n")
   
    return keep_these_words

# or like this, without FreqDist:
# def do_word_count(demo_dir, numdims):
#     word_count = { }

#     for filename in os.listdir(demo_dir):
#         if filename.endswith("txt"):
#             print("reading file", filename)
#         text = open(os.path.join(demo_dir, filename)).read()
#         for taggedword in preprocess(text):
#             if taggedword not in word_count:
#                 word_count[ taggedword ] = 0
#             word_count[ taggedword ] += 1
#
#     def map_word_to_count(word): return word_count[ word ]
#     keep_these_words = sorted(word_count.keys(), key = map_word_to_count)[:numdims]
#    
#     # print("Target words (and also dimensions):\n", keep_these_words, "\n")
#
#     return keep_these_words


##
# run this:
def test_wordcount():
    print("Doing a frequency-based cutoff: keeping only the N most frequent context words.")
   
    # with 10 dimensions
    keepwords = do_word_count(demo_dir, 10)
    print("Keeping only 10 dimensions, then I get:", keepwords, "\n")

    # with 100 dimensions
    keepwords = do_word_count(demo_dir, 100)
    print("Keeping 100 dimensions, then I get:", keepwords, "\n")

##
# We will need the function make_word_index below.
# It maps each word that we want to keep around as a context item
# to an index, which will be its place in the table of counts,
# that is, its dimension in the space
def make_word_index(keep_these_words):
    # make an index that maps words from 'keep_these_words' to their index
    word_index = { }
    for index, word in enumerate(keep_these_words):
        word_index[ word ] = index

    return word_index
  
####################
# Identifying context items for a word

###
# identifying context words for a narrow context window of 2 words on either side
# of the target:
# takes as input a sequence of words for counting.
# For each word in the sequence, make 4 pairs:
# (word, left neighbor of word), (word, left neighbor of left neighbor of word),
# (word, right neighbor of word), (word, right neighbor of right neighbor of word),
# so pair each word with all its context items in the context window.
# Return a list of these pairs.
def co_occurrences(wordsequence):
    target_context_pairs = [ ]

    # for a sequence of length N, count from 0 to N-1
    for index in range(len(wordsequence) - 1):
        # count that word[index] as a target co-occurred with the next word as a context item,
        # and vice versa
        target_context_pairs.append( (wordsequence[index], wordsequence[index+1]) )
        target_context_pairs.append( (wordsequence[index+1], wordsequence[index]) )

        if index + 2 < len(wordsequence):
            # there is a word 2 words away
            # count that word[index] as a target co-occurred with the but-next word as a context item,
            # and vice versa
            target_context_pairs.append( (wordsequence[index], wordsequence[index+2]) )
            target_context_pairs.append( (wordsequence[index+2], wordsequence[index]) )

    return target_context_pairs

###
# run this to test co-occurrences
def test_cooccurrences():
    text = """You will not find Dr. Jekyll; he is from home," replied Mr. Hyde"""
    print("Testing the function that pairs up each target word with its context words.")
    print("Original text:", text, "\n")

    words = preprocess(text)
    cooc = co_occurrences(words)
    print("These are the target/context pairs:", cooc, "\n")

###################
# doing the actual counting:
# We keep the counts for each word in a numpy array.
# The mapping from word to counts is done through a Python dictionary

import numpy

# read all files in demo_dir, and compute a counts vector
# of length numdims for each relevant word.
# The function takes as input also a mapping word_index from relevant words
# to their dimension, from which we derive a set relevant_words.
# This function reads the texts one sentence at a time.
# In each sentence, it identifies context words in the window
# defined by co_occurrences(), and stores them if both the target
# and its context words are relevant_words
def make_space(demo_dir, word_index, numdims):

    # relevant words: those that have an entry in word_index
    relevant_words = set(word_index.keys())

    # space: a mapping from relevant_words to an array of integers (raw counts)
    space = { }
    # fill the space with all zeros.
    for word in relevant_words:
        space[ word ] = numpy.zeros(numdims, dtype = numpy.int)

    ##
    # Design decision: We want to take sentence boundaries into account
    # when computing distributional representations.
    # So we need to detect sentence boundaries first.
    sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

    # We iterate over the corpus files
    # and count word co-occurrences in a window of 2
    for filename in os.listdir(demo_dir):
        if filename.endswith("txt"):
            print("reading file", filename)
            # read the text
            text = open(os.path.join(demo_dir, filename)).read()
            # split the text into sentences
            sentences = sent_detector.tokenize(text)
            # process one sentence at a time
            for sentence in sentences:
                words = preprocess(sentence)

                # determine pairs of co-occurrences to count,
                # and store them in the matrix
                for target, cxitem in co_occurrences(words):
                    # are these two words relevant?
                    if target in relevant_words and cxitem in relevant_words:
                        # what is the row for this context item?
                        cxitem_index = word_index[ cxitem]
                        # now count
                        space[ target ][cxitem_index] += 1


    return space

###
# run this
def test_space():
    numdims = 50
    # which words to use as targets and context words?
    ktw = do_word_count(demo_dir, numdims)
    # mapping words to an index, which will be their column
    # in the table of counts
    wi = make_word_index(ktw)
    words_in_order = sorted(wi.keys(), key=lambda w:wi[w])
   
    print("word index:")
    for word in words_in_order:
        print(word, wi[word], end= " ")
    print("\n")

    space = make_space(demo_dir, wi, numdims)
   
    print("some words from the space")
    for w in words_in_order[:10]:
        print(w,  space[w], "\n")
   

#########
# transform the space using positive pointwise mutual information

# target t, dimension value c, then
# PMI(t, c) = log ( P(t, c) / (P(t) P(c)) )
# where
# P(t, c) = #(t, c) / #(_, _)
# P(t) = #(t, _) / #(_, _)
# P(c) = #(_, c) / #(_, _)
#
# PPMI(t, c) =   PMI(t, c) if PMI(t, c) > 0
#                0 else
def ppmi_transform(space, word_index):
    # #(t, _): for each target word, sum up all its counts.
    # row_sums is a dictionary mapping from target words to row sums
    row_sums = { }
    for word in space.keys():
        row_sums[word] = space[word].sum()

    # #(_, c): for each context word, sum up all its counts
    # This should be the same as #(t, _) because the set of targets
    # is the same as the set of contexts.
    # col_sums is a dictionary mapping from context word indices to column sums
    col_sums = { }
    for index in word_index.values():
        col_sums[ index ] = sum( [ vector[ index ] for vector in space.values() ])

    # sanity check: row sums same as column sums?
    for word in space.keys():
        if row_sums[word] != col_sums[ word_index[word]]:
            print("whoops, failed sanity check for", word, row_sums[word], col_sums[word_index[word]])
   
    # #(_, _): overall count of occurrences. sum of all row_sums
    all_sums = sum(row_sums.values())

    # if all_sums is zero, there's nothing we can do
    # because we then cannot divide by #(_, _)
    if all_sums == 0:
        print("completely empty space, returning it unchanged")
        return space

    # P(t) = #(t, _) / #(_, _)
    p_t = { }
    for word in space.keys():
        p_t[ word ] = row_sums[ word ] / all_sums

    # P(c) = #(_, c) / #(_, _)
    p_c = { }
    for index in col_sums.keys():
        p_c[ index ] = col_sums[ index ] / all_sums

    # ppmi_space: a mapping from words to vectors of values
    ppmi_space = { }
    # first we map from words to values P(t, c)
    for word in space.keys():
        ppmi_space[ word ] = space[ word ] / all_sums
    # divide each entry by P(t)
    for word in space.keys():
        if p_t[ word ] == 0:
            # I haven't seen this word ever, so I cannot
            # divide by P(t). But the whole entry for this word
            # should be 0's, so leave as is.
            pass
        else:
            ppmi_space[ word ] = ppmi_space[ word ] / p_t[ word ]
    # divide each entry by P(c)
    for index in p_c.keys():
        if p_c[ index ] == 0:
            # I haven't seen this context item ever,
            # so I cannot divide by P(c).
            # But every target word will have an entry of 0.0
            # on this column, so nothing more to do.
            pass
        else:
            for word in space.keys():
                ppmi_space[ word ][index] = ppmi_space[ word][index] / p_c[ index ]
               
    # take the logarithm, ignore entries that are zero
    for word in space.keys():
        with numpy.errstate(divide="ignore",invalid="ignore"):
            ppmi_space[ word ] = numpy.log(ppmi_space[ word ])
           

    # turn negative numbers to zero
    for word in space.keys():
        ppmi_space[word] = numpy.maximum(ppmi_space[word], 0.0)

    return ppmi_space

###
# run this:
def test_ppmispace():
    numdims = 50
    # which words to use as targets and context words?
    ktw = do_word_count(demo_dir, numdims)
    # mapping words to an index, which will be their column
    # in the table of counts
    wi = make_word_index(ktw)
    words_in_order = sorted(wi.keys(), key=lambda w:wi[w])
   
    print("word index:")
    for word in words_in_order:
        print(word, wi[word], end=" ")
    print("\n")

    space = make_space(demo_dir, wi, numdims)
    ppmispace = ppmi_transform(space, wi)
   
    print("some raw counts vectors and some ppmi vectors")
    for w in words_in_order[:10]:
        print("---------", "\n", w)
        print("raw", space[w])
        # for the PPMI space, we're rounding to 2 digits after the floating point
        print("ppmi", numpy.round(ppmispace[w], 2), "\n")
       

#################
# transforming the space using singular value decomposition.
#
def svd_transform(space, originalnumdimensions,keepnumdimensions):
    # space is a dictionary mapping words to vectors.
    # combine those into a big matrix.
    spacematrix = numpy.empty((len(space.keys()), originalnumdimensions))

    rowlabels = sorted(space.keys())

    for index, word in enumerate(rowlabels):
        spacematrix[index] = space[word]

    # now do SVD
    umatrix, sigmavector, vmatrix = numpy.linalg.svd(spacematrix)

    # remove the last few dimensions of u and sigma
    utrunc = umatrix[:, :keepnumdimensions]
    sigmatrunc = sigmavector[ :keepnumdimensions]

    # new space: U %matrixproduct% Sigma_as_diagonal_matrix  
    newspacematrix = numpy.dot(utrunc, numpy.diag(sigmatrunc))

    # transform back to a dictionary mapping words to vectors
    newspace = { }
    for index, word in enumerate(rowlabels):
        newspace[ word ] = newspacematrix[index]
       
    return newspace


####
# run this:
def test_svdspace():
    numdims = 50
    # which words to use as targets and context words?
    ktw = do_word_count(demo_dir, numdims)
    # mapping words to an index, which will be their column
    # in the table of counts
    wi = make_word_index(ktw)
    words_in_order = sorted(wi.keys(), key=lambda w:wi[w])
   
    print("word index:")
    for word in words_in_order:
        print(word, wi[word], end=" ")
    print("\n")

    space = make_space(demo_dir, wi, numdims)
    ppmispace = ppmi_transform(space, wi)
    svdspace = svd_transform(ppmispace, numdims, 5)
   
    print("some vectors")
    for w in words_in_order[:10]:
        print("--------------", "\n", w)
        print("raw", space[w])
        # for the PPMI and SVD spaces, we're rounding to 2 digits after the floating point
        print("ppmi", numpy.round(ppmispace[w], 2), "\n")
        print("svd", numpy.round(svdspace[w], 2), "\n")
       

###
# similarity measure: cosine
#                           sum_i vec1_i * vec2_i
# cosine(vec1, vec2) = ------------------------------
#                        veclen(vec1) * veclen(vec2)
# where
#
# veclen(vec) = squareroot( sum_i vec_i*vec_i )
#

import math

def veclen(vector):
    return math.sqrt(numpy.sum(numpy.square(vector)))

def cosine(word1, word2, space):
    vec1 = space[ word1 ]
    vec2 = space[word2]

    veclen1 = veclen(vec1)
    veclen2 = veclen(vec2)

    if veclen1 == 0.0 or veclen2 == 0.0:
        # one of the vectors is empty. make the cosine zero.
        return 0.0

    else:
        # we could also simply do:
        # dotproduct = numpy.dot(vec1, vec2)
        dotproduct = numpy.sum(vec1 * vec2)

        return dotproduct / (veclen1 * veclen2)


#######
# run this:
def test_cosine():
    # this time we're not removing any words
    numdims = 100
    # which words to use as targets and context words?
    ktw = do_word_count(demo_dir, numdims)
    # mapping words to an index, which will be their column
    # in the table of counts
    wi = make_word_index(ktw)
    words_in_order = sorted(wi.keys(), key=lambda w:wi[w])
   
    space = make_space(demo_dir, wi, numdims)
    ppmispace = ppmi_transform(space, wi)
    svdspace = svd_transform(ppmispace, numdims, 5)
   
    print("some cosines")
    print("'lawyer' and 'lean':")
    print("raw", cosine("lawyer", "lean", space))
    print("ppmi", cosine("lawyer", "lean", ppmispace))
    print("svd", cosine("lawyer", "lean", svdspace))
   
    print("'a' and 'and':")
    print("raw", cosine("a", "and", space))
    print("ppmi", cosine("a", "and", ppmispace))
    print("svd", cosine("a", "and", svdspace))

    print("'friendly' and 'cold':")
    print("raw", cosine("friendly", "cold", space))
    print("ppmi", cosine("friendly", "cold", ppmispace))
    print("svd", cosine("friendly", "cold", svdspace))
   

####################
# finding the word most similar to a given target
def most_similar_to(word1, space):

    sims = [ (word2, cosine(word1, word2, space)) for word2 in space.keys() if word2 != word1 ]

    return sorted(sims, key = lambda p:p[1], reverse=True)


#############################
# run this:
def test_mostsimilar():
    # this time we're not removing any words
    numdims = 100
    # which words to use as targets and context words?
    ktw = do_word_count(demo_dir, numdims)
    # mapping words to an index, which will be their column
    # in the table of counts
    wi = make_word_index(ktw)
    words_in_order = sorted(wi.keys(), key=lambda w:wi[w])
   
    space = make_space(demo_dir, wi, numdims)
    ppmispace = ppmi_transform(space, wi)
    svdspace = svd_transform(ppmispace, numdims, 5)
   
    print("ten most similar to 'friendly':")
    print("raw", most_similar_to("friendly", space)[:10])
    print("ppmi", most_similar_to("friendly", ppmispace)[:10])
    print("svd", most_similar_to("friendly", svdspace)[:10])         


     
 

Comments