Language and Computers n-gram demo
Here is a small Python program that demonstrates word n-grams and conditional probabilities.
To run it, you need Python 3 and the natural language toolkit. Please make sure also to download the NLTK data.
i
mport nltk
from nltk.corpus import brown
# quick look at typical word pairs: via pre-given collocations function
brown_nltk = nltk.Text(brown.words())
brown_nltk.collocations()
# bigrams in the Brown corpus:
# fd is a data structure that tabulates frequencies of strings.
# in this case, frequencies of word bigrams from Brown
brown_bigrams = [a + " " + b for a, b in nltk.bigrams(brown.words())]
fd = nltk.FreqDist(brown_bigrams)
# frequent stuff
fd.tabulate(10)
# infrequent stuff
for h in fd.hapaxes(): print(h)
# P(word2 | word1) = frequency of word1 word2 / frequency of word1 SOMETHING
# out of all times we have seen bigrams starting in word1,
# what percentage was word2?
# cfd is a data structure that tabulates the frequencies of pairs:
# In our case, it maps words word1 to words word2 that appeared after them,
# and records how often each word2 was seen to follow word1
cfd = nltk.ConditionalFreqDist(nltk.bigrams(brown.words()))
# this is a data structure that tabulates frequencies of words that followed "The".
# Note that the words word2 are ordered by frequency:
cfd["The"]
# overall, we have seen "The" 7258 times
cfd["The"].N()
# ... and we have seen "The first" 96 times.
cfd["The"]["first"]
# The probability P(first | The) is 96 / 7258
cfd["The"]["first"] / cfd["The"].N()
# Let's type a text by starting at "The" and then
# always using the most frequent word that could follow.
# You may have done this on your phone.
# But your phone is certainly not trained on the Brown corpus.
cfd["The"].max()
#...
# or, for short, like this:
word = "The"
for i in range(20):
print(word)
word = cfd[word].max()
# whoops, this got us a never-ending sentence