Python code for February 28

Sorting in Python:

sorted() sorts lists.

>>> x = [4,2,5,1]
>>> sorted(x)
[1, 2, 4, 5]

Note that it produces a new, sorted list. The original stays as is.

>>> x
[4, 2, 5, 1]

We can also sort from large to small:
>>> sorted(x, reverse = True)
[5, 4, 2, 1]

When we have a dictionary of counts, we often want to output words sorted by their counts. This is not so straightforward. The first problem is that sorted(mydict) by default works just on the keys. We can solve this problem by sorting the items(). But they are still sorted alphabetically by their keys.
>>> mydict = { "the" : 10000, "a": 11000, "albatross":2}
>>> sorted(mydict)
['a', 'albatross', 'the']
>>> sorted(mydict.items())
[('a', 11000), ('albatross', 2), ('the', 10000)]

What we need to do is to tell sorted(): For each pair (key, value), map that to the value, and sort on that. To do so, we need to define a function that maps (key, value) pairs to the value:

>>> def pair_to_2nd(pair): return pair[1]

This function returns the 2nd item of any pair it is given. Now we hand this to sorted(). "key = pair_to_2nd" means that the key by which we sort is determined by pair_to_2nd().

>>> sorted(mydict.items(), key = pair_to_2nd)
[('albatross', 2), ('the', 10000), ('a', 11000)]
>>> sorted(mydict.items(), key = pair_to_2nd, reverse = True)
[('a', 11000), ('the', 10000), ('albatross', 2)]

Refinement, not strictly necesary: We can simplify this code some more by using a "nameless function". Instead of defining pair_to_2nd separately, we can define it right in the call to sorted().
lambda pair: pair[1] is Pythonese for a function with argument "pair" that is mapped to "pair[1]". Since we are defining the function right where we use it, we don't need to give it a name. "lambda" is a reserved word in Python.

>>> sorted(mydict.items(), key = lambda pair: pair[1], reverse = True)
[('a', 11000), ('the', 10000), ('albatross', 2)]

Input to Python

One method of reading user input: using input() or raw_input().
Try this:
mystring = raw_input()

Now type anything, followed by "return". Now inspect mystring.

We can use this as follows:
print "Please input a filename"
filename = raw_input()
f = open(filename)
text =

When you call Python from the command line, you can also add information that you want to pass on to Python. Please save the following text as a file
import sys

print "1st member of sys.argv:", sys.argv[0]
print "2nd member of sys.argv:", sys.argv[1]

Then call it from the command line using
python 123

Also try some other word after the "".

The list sys.argv contains the words that you type on the command line when you call Python:
sys.argv[0] is the name of the program.
sys.argv[1] and higher are additional words that the user has typed. It is common usage to put information for your program to use in these command line parameters.

Counting words

Here is a word-counting script. Please save the following as a file ending in .py, and call it from the command line with 2 additional words in the command line:
  • the name of an input filename, for example Macbeth from Project Gutenberg.
  • the name of a new, nonexisting file. Python will write the results there.
count words in a given corpus, output to given output file
import sys
import string

infilename = sys.argv[1]
outfilename = sys.argv[2]

f = open(infilename)
wordcount = { }

for line in f:
    words = [ w.strip(string.punctuation).lower() for w in line.split()]
    for w in words:
        if w != "":
            if w in wordcount:
                wordcount[w] = wordcount[w] + 1
                wordcount[w] = 1


outf = open(outfilename, "w")
sorted_wordcount= sorted(wordcount.items(), key = lambda pair:pair[1], reverse=True)
for word, count in sorted_wordcount:
    print >> outf, word, count

Now inspect the output file.

Counting Bigrams: Version 1

Again, please copy the following code into a text file. Call it from the command line, using the name of a file with text as an argument.

# Katrin Erk Oct 07
# Updated Feb 11
# Word bigrams are just pairs of words.
# In the sentence "I went to the beach"
# the bigrams are:
#    I went
#    went to
#    to the
#    the beach
# Having counts of English bigrams from a very large text corpus
# can be useful for a number of purposes.
# for example for spelling correction:
# If I had mistyped the sentence as "I went to beach"
# then I might be able to find the error by seeing that
# the bigram "to beach" has a very low count, and
# "to the", "to a", and "the beach" have much larger counts.
# This program counts all word bigrams in a given text file
# usage:
# python <filename>
# <filename> is a text file.

import string
import sys

# complain if we didn't get a filename
# as a command line argument
if len(sys.argv) < 2:
    print "Please enter the name of a corpus file as a command line argument."
# try opening file
# If the file doesn't exist, catch the error
    f = open(sys.argv[1])
except IOError:
    print "Sorry, I could not find the file", sys.argv[1]
    print "Please try again."
# read the contents of the whole file into ''filecontents''
filecontents =
# count bigrams
bigrams = {}
words_punct = filecontents.split()
# strip all punctuation at the beginning and end of words, and
# convert all words to lowercase
words = [ w.strip(string.punctuation).lower() for w in words_punct ]

# add special START, END tokens
words = ["START"] + words + ["END"]

for index, word in enumerate(words):
    if index < len(words) - 1:
        # we only look at indices up to the
        # next-to-last word, as this is
        # the last one at which a bigram starts
        w1 = words[index]
        w2 = words[index + 1]
        # bigram is a tuple,
        # like a list, but fixed.
        # Tuples can be keys in a dictionary
        bigram = (w1, w2)

        if bigram in bigrams:
            bigrams[ bigram ] = bigrams[ bigram ] + 1
            bigrams[ bigram ] = 1

# sort bigrams by their counts
sorted_bigrams = sorted(bigrams.items(), key = lambda pair:pair[1], reverse = True)

for bigram, count in sorted_bigrams
    print bigram, ":", count

Counting bigrams, version 2

The Natural Language Toolkit has data types and functions that make life easier for us when we want to count bigrams and compute their probabilities.

import nltk

from nltk.corpus import brown

# an nltk.FreqDist() is like a dictionary,
# but it is ordered by frequency.
# Also, nltk automatically fills the dictionary
# with counts when given a list of words.
freq_brown = nltk.FreqDist(brown.words())


# an nltk.ConditionalFreqDist() counts frequencies of pairs.
# When given a list of bigrams, it maps each first word of a bigram
# to a FreqDist over the second words of the bigram.
cfreq_brown_2gram = nltk.ConditionalFreqDist(nltk.bigrams(brown.words()))

# conditions() in a ConditionalFreqDist are like keys()
# in a dictionary

# the cfreq_brown_2gram entry for "my" is a FreqDist.

# here are the words that can follow after "my".
# We first access the FreqDist associated with "my",
# then the keys in that FreqDist

# here are the 20 most frequent words to come after "my", with their frequencies

# an nltk.ConditionalProbDist() maps pairs to probabilities.
# One way in which we can do this is by using Maximum Likelihood Estimation (MLE)
cprob_brown_2gram = nltk.ConditionalProbDist(cfreq_brown_2gram, nltk.MLEProbDist)

# This again has conditions() wihch are like dictionary keys

# Here is what we find for "my": a Maximum Likelihood Estimation-based probability distribution,
# as a MLEProbDist object.

# We can find the words that can come after "my" by using the function samples()

# Here is the probability of a particular pair:

# and we can draw a random word to follow "my"
# based on the probabilities of the bigrams

# We can use this to generate text at random
# based on a given text of bigrams.
# Let's do this for the Sam "corpus"
corpus = """<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>"""

words = corpus.split()
cfreq_sam = nltk.ConditionalFreqDist(nltk.bigrams(words))
cprob_sam = nltk.ConditionalProbDist(cfreq_sam, nltk.MLEProbDist)

word = "<s>"
for index in range(50):
    word = cprob_sam[ word].generate()
    print word,

# Not a lot of variety. We need a bigger corpus.
cfreq_scifi = nltk.ConditionalFreqDist(nltk.bigrams(brown.words(categories = "science_fiction")))
cprob_scifi = nltk.ConditionalProbDist(cfreq_scifi, nltk.MLEProbDist)

word = "in"
for index in range(50):
    word = cprob_scifi[ word ].generate()
    print word,

# try this with other Brown corpus categories.

# For the objects, there is a generate() function.
from import *

# Do you think they used bigrams like we did earlier, or some larger n-grams?