Language and Computers n-gram demo

Here is a small Python program that demonstrates word n-grams and conditional probabilities.

To run it, you need Python 3 and the natural language toolkit. Please make sure also to download the NLTK data. 

i
mport nltk
from nltk.corpus import brown


# quick look at typical word pairs: via pre-given collocations function
brown_nltk = nltk.Text(brown.words())
brown_nltk.collocations()

# bigrams in the Brown corpus:
# fd is a data structure that tabulates frequencies of strings.
# in this case, frequencies of word bigrams from Brown
brown_bigrams = [a + " " + b for a, b in nltk.bigrams(brown.words())]
fd = nltk.FreqDist(brown_bigrams)

# frequent stuff
fd.tabulate(10)

# infrequent stuff
for h in fd.hapaxes(): print(h)

# P(word2 | word1) = frequency of word1 word2 / frequency of word1 SOMETHING
# out of all times we have seen bigrams starting in word1,
# what percentage was word2?

# cfd is a data structure that tabulates the frequencies of pairs:
# In our case, it maps words word1 to words word2 that appeared after them,
# and records how often each word2 was seen to follow word1
cfd = nltk.ConditionalFreqDist(nltk.bigrams(brown.words()))

# this is a data structure that tabulates frequencies of words that followed "The".
# Note that the words word2 are ordered by frequency:
cfd["The"]
# overall, we have seen "The" 7258 times
cfd["The"].N()
# ... and we have seen "The first" 96 times.
cfd["The"]["first"]
# The probability P(first | The) is  96 / 7258
cfd["The"]["first"] / cfd["The"].N()

# Let's type a text by starting at "The" and then
# always using the most frequent word that could follow.
# You may have done this on your phone.
# But your phone is certainly not trained on the Brown corpus.
cfd["The"].max()
#...

# or, for short, like this:
word = "The"
for i in range(20):
    print(word)
    word = cfd[word].max()

# whoops, this got us a never-ending sentence



Comments