### Python code for February 28

 Sorting in Python:sorted() sorts lists.`>>> x = [4,2,5,1]``>>> sorted(x)``[1, 2, 4, 5]`Note that it produces a new, sorted list. The original stays as is.`>>> x``[4, 2, 5, 1]``>>> `We can also sort from large to small:`>>> sorted(x, reverse = True)``[5, 4, 2, 1]`When we have a dictionary of counts, we often want to output words sorted by their counts. This is not so straightforward. The first problem is that sorted(mydict) by default works just on the keys. We can solve this problem by sorting the items(). But they are still sorted alphabetically by their keys.`>>> mydict = { "the" : 10000, "a": 11000, "albatross":2}``>>> sorted(mydict)``['a', 'albatross', 'the']``>>> sorted(mydict.items())``[('a', 11000), ('albatross', 2), ('the', 10000)]`What we need to do is to tell sorted(): For each pair (key, value), map that to the value, and sort on that. To do so, we need to define a function that maps (key, value) pairs to the value:`>>> def pair_to_2nd(pair): return pair``... `This function returns the 2nd item of any pair it is given. Now we hand this to sorted(). "key = pair_to_2nd" means that the key by which we sort is determined by pair_to_2nd(). `>>> sorted(mydict.items(), key = pair_to_2nd)``[('albatross', 2), ('the', 10000), ('a', 11000)]``>>> sorted(mydict.items(), key = pair_to_2nd, reverse = True)``[('a', 11000), ('the', 10000), ('albatross', 2)]`Refinement, not strictly necesary: We can simplify this code some more by using a "nameless function". Instead of defining pair_to_2nd separately, we can define it right in the call to sorted(). lambda pair: pair is Pythonese for a function with argument "pair" that is mapped to "pair". Since we are defining the function right where we use it, we don't need to give it a name. "lambda" is a reserved word in Python.`>>> sorted(mydict.items(), key = lambda pair: pair, reverse = True)``[('a', 11000), ('the', 10000), ('albatross', 2)]`Input to PythonOne method of reading user input: using input() or raw_input().Try this:`mystring = raw_input()`Now type anything, followed by "return". Now inspect `mystring`.We can use this as follows:`print "Please input a filename"``filename = raw_input()``f = open(filename)``text = f.read()``f.close()`When you call Python from the command line, you can also add information that you want to pass on to Python. Please save the following text as a file commandline.py:`import sys``print "1st member of sys.argv:", sys.argv``print "2nd member of sys.argv:", sys.argv`Then call it from the command line using`python commandline.py 123`Also try some other word after the "commandline.py". The list `sys.argv` contains the words that you type on the command line when you call Python:sys.argv is the name of the program.sys.argv and higher are additional words that the user has typed. It is common usage to put information for your program to use in these command line parameters. Counting wordsHere is a word-counting script. Please save the following as a file ending in .py, and call it from the command line with 2 additional words in the command line:the name of an input filename, for example Macbeth from Project Gutenberg.the name of a new, nonexisting file. Python will write the results there. #` count words in a given corpus, output to given output file``import sys``import string``infilename = sys.argv``outfilename = sys.argv``f = open(infilename)``wordcount = { }``for line in f:``    words = [ w.strip(string.punctuation).lower() for w in line.split()]``    for w in words:``        if w != "":``            if w in wordcount:``                wordcount[w] = wordcount[w] + 1``            else:``                wordcount[w] = 1``f.close()``outf = open(outfilename, "w")``sorted_wordcount= sorted(wordcount.items(), key = lambda pair:pair, reverse=True)``for word, count in sorted_wordcount:``    print >> outf, word, count``outf.close()`Now inspect the output file.Counting Bigrams: Version 1Again, please copy the following code into a text file. Call it from the command line, using the name of a file with text as an argument.`# Katrin Erk Oct 07``# Updated Feb 11``#``# Word bigrams are just pairs of words.``# In the sentence "I went to the beach"``# the bigrams are:``#    I went``#    went to``#    to the``#    the beach``#``# Having counts of English bigrams from a very large text corpus``# can be useful for a number of purposes.``#``# for example for spelling correction:``# If I had mistyped the sentence as "I went to beach"``# then I might be able to find the error by seeing that``# the bigram "to beach" has a very low count, and``# "to the", "to a", and "the beach" have much larger counts.``#``# This program counts all word bigrams in a given text file``#``# usage:``# python count_bigrams.py ``# ``# is a text file. ``import string``import sys``# complain if we didn't get a filename``# as a command line argument``if len(sys.argv) < 2:``    print "Please enter the name of a corpus file as a command line argument."``    sys.exit()``    ``# try opening file ``# If the file doesn't exist, catch the error``try:``    f = open(sys.argv)``except IOError:``    print "Sorry, I could not find the file", sys.argv``    print "Please try again."``    sys.exit()``    ``# read the contents of the whole file into ''filecontents''``filecontents = f.read()``        ``# count bigrams ``bigrams = {} ``words_punct = filecontents.split() ``# strip all punctuation at the beginning and end of words, and ``# convert all words to lowercase``words = [ w.strip(string.punctuation).lower() for w in words_punct ]``# add special START, END tokens``words = ["START"] + words + ["END"]``for index, word in enumerate(words):``    if index < len(words) - 1:``        # we only look at indices up to the``        # next-to-last word, as this is``        # the last one at which a bigram starts``        w1 = words[index] ``        w2 = words[index + 1]``        # bigram is a tuple,``        # like a list, but fixed.``        # Tuples can be keys in a dictionary``        bigram = (w1, w2)``        if bigram in bigrams:``            bigrams[ bigram ] = bigrams[ bigram ] + 1``        else:``            bigrams[ bigram ] = 1``# sort bigrams by their counts``sorted_bigrams = sorted(bigrams.items(), key = lambda pair:pair, reverse = True)``for bigram, count in sorted_bigrams``    print bigram, ":", count`Counting bigrams, version 2The Natural Language Toolkit has data types and functions that make life easier for us when we want to count bigrams and compute their probabilities.`import nltk``from nltk.corpus import brown``# an nltk.FreqDist() is like a dictionary,``# but it is ordered by frequency.``# Also, nltk automatically fills the dictionary``# with counts when given a list of words.``freq_brown = nltk.FreqDist(brown.words())``freq_brown.keys()``freq_brown.items()[:20]``# an nltk.ConditionalFreqDist() counts frequencies of pairs.``# When given a list of bigrams, it maps each first word of a bigram``# to a FreqDist over the second words of the bigram.``cfreq_brown_2gram = nltk.ConditionalFreqDist(nltk.bigrams(brown.words()))``# conditions() in a ConditionalFreqDist are like keys()``# in a dictionary``cfreq_brown_2gram.conditions()``# the cfreq_brown_2gram entry for "my" is a FreqDist.``cfreq_brown_2gram["my"]``# here are the words that can follow after "my".``# We first access the FreqDist associated with "my",``# then the keys in that FreqDist``cfreq_brown_2gram["my"].keys()``# here are the 20 most frequent words to come after "my", with their frequencies``cfreq_brown_2gram["my"].items()[:20]``# an nltk.ConditionalProbDist() maps pairs to probabilities.``# One way in which we can do this is by using Maximum Likelihood Estimation (MLE)``cprob_brown_2gram = nltk.ConditionalProbDist(cfreq_brown_2gram, nltk.MLEProbDist)``# This again has conditions() wihch are like dictionary keys``cprob_brown_2gram.conditions()``# Here is what we find for "my": a Maximum Likelihood Estimation-based probability distribution,``# as a MLEProbDist object.``cprob_brown_2gram["my"]``# We can find the words that can come after "my" by using the function samples()``cprob_brown_2gram["my"].samples()``# Here is the probability of a particular pair:``cprob_brown_2gram["my"].prob("own")``# and we can draw a random word to follow "my"``# based on the probabilities of the bigrams``cprob_brown_2gram["my"].generate()``# We can use this to generate text at random``# based on a given text of bigrams.``# Let's do this for the Sam "corpus"``corpus = """ I am Sam `` Sam I am `` I do not like green eggs and ham """``words = corpus.split()``cfreq_sam = nltk.ConditionalFreqDist(nltk.bigrams(words))``cprob_sam = nltk.ConditionalProbDist(cfreq_sam, nltk.MLEProbDist)``word = ""``for index in range(50):``    word = cprob_sam[ word].generate()``    print word,``print``# Not a lot of variety. We need a bigger corpus.``brown.categories()    ``cfreq_scifi = nltk.ConditionalFreqDist(nltk.bigrams(brown.words(categories = "science_fiction")))``cprob_scifi = nltk.ConditionalProbDist(cfreq_scifi, nltk.MLEProbDist)``word = "in"``for index in range(50):``    word = cprob_scifi[ word ].generate()``    print word,``print``# try this with other Brown corpus categories.``# For the nltk.book objects, there is a generate() function.``from nltk.book import *``text6.generate()``text7.generate()``text2.generate()``# Do you think they used bigrams like we did earlier, or some larger n-grams?`