Language and Computers n-gram demo

Here is a small Python program that demonstrates word n-grams and conditional probabilities.

To run it, you need Python 3 and the natural language toolkit. Please make sure also to download the NLTK data.

i

mport nltk

from nltk.corpus import brown

# quick look at typical word pairs: via pre-given collocations function

brown_nltk = nltk.Text(brown.words())

brown_nltk.collocations()

# bigrams in the Brown corpus:

# fd is a data structure that tabulates frequencies of strings.

# in this case, frequencies of word bigrams from Brown

brown_bigrams = [a + " " + b for a, b in nltk.bigrams(brown.words())]

fd = nltk.FreqDist(brown_bigrams)

# frequent stuff

fd.tabulate(10)

# infrequent stuff

for h in fd.hapaxes(): print(h)

# P(word2 | word1) = frequency of word1 word2 / frequency of word1 SOMETHING

# out of all times we have seen bigrams starting in word1,

# what percentage was word2?

# cfd is a data structure that tabulates the frequencies of pairs:

# In our case, it maps words word1 to words word2 that appeared after them,

# and records how often each word2 was seen to follow word1

cfd = nltk.ConditionalFreqDist(nltk.bigrams(brown.words()))

# this is a data structure that tabulates frequencies of words that followed "The".

# Note that the words word2 are ordered by frequency:

cfd["The"]

# overall, we have seen "The" 7258 times

cfd["The"].N()

# ... and we have seen "The first" 96 times.

cfd["The"]["first"]

# The probability P(first | The) is 96 / 7258

cfd["The"]["first"] / cfd["The"].N()

# Let's type a text by starting at "The" and then

# always using the most frequent word that could follow.

# You may have done this on your phone.

# But your phone is certainly not trained on the Brown corpus.

cfd["The"].max()

#...

# or, for short, like this:

word = "The"

for i in range(20):

print(word)

word = cfd[word].max()

# whoops, this got us a never-ending sentence