Courses‎ > ‎Python worksheets‎ > ‎

Python code snippets

Converting an Excel document to XML format

This code assumes that you have an excel sheet with three columns. The first column has a sentence, the second column the sentence length, and the third column has the main verb of the sentence (not that this makes too much sense; this is just to have an example.)

Here's an example of what this might look like:
 This is a sentence
 4 be
 It is raining.
 3 rain

We assume that we have written the excel sheet out to "example.csv" in CSV (comma-separated values) format.

import csv

# reading the CSV file
f = open("/Users/katrinerk/Desktop/example.csv")
reader = csv.reader(f)

# writing to a text file
outf = open("/Users/katrinerk/Desktop/example.xml", "w")

# write XML header
print >> outf, '<?xml version="1.0" encoding="ASCII" ?>'

# The structure of the XML document is going to be:
# <examplefile>
#   <row>
#      <sentence>...</sentence><sentlen>...</sentlen><verb>...</verb>
#  </row>
# .....
# </examplefile
print >> outf, '<examplefile>'

for row in reader:
     print >> outf, '<row>'

     # row is a list with three elements, one per column. We assign its members
     # to three different variables.
     sentence, sentlen, verb = row

     print >> outf, '<sentence>', sentence, '</sentence>'
     print >> outf, '<sentlen>', sentlen, '</sentlen>'
     print >> outf, '<verb>', verb, '</verb>'
     print >> outf, '</row>'

# processed all rows. End the file now
print >> outf, "</examplefile>"
outf.close()
f.close()

Working with command line arguments

This is for when you write a longer Python script that you call from the command line. When you put further words on the command line after the name of your program, these words are recorded and passed on to Python in the list sys.argv. Here is an example use:

import sys

print "sys.argv has this many members:", len(sys.argv)

for index in range(len(sys.argv)):
    print index, sys.argv[index]

Put this code in a file, for example example.py, and call it from the command line. Here are 3 sample uses.

python example.py
python example.py abc blah blupp
python example.py /Users/katrinerk/myfile.txt

Sorting lists

The method sort() sorts a list in place:

>>> mylist = ["armadillo", "zebra", "guppy", "cat"]
>>> mylist.sort()
>>> mylist
['armadillo', 'cat', 'guppy', 'zebra']

The function sorted(), on the other hand, leaves the original list unchanged, but returns a sorted list.

>>> mylist = ["armadillo", "zebra", "guppy", "cat"]
>>> sorted(mylist)
['armadillo', 'cat', 'guppy', 'zebra']
>>> mylist
['armadillo', 'zebra', 'guppy', 'cat']


The parameter reverse=True inverts the sorting order.

>>> sorted(mylist, reverse = True)
['zebra', 'guppy', 'cat', 'armadillo']
>>>

Sorting Python dictionaries by their values

Suppose we have a dictionary of word counts, for example

counts = { "the" : 3, "cat":2, "and":1, "chased":1, "dog":1}


If we try to sort it using the same formulation as for lists above, we get the dictionary keys in alphabetical order:
>>> sorted(counts)
['and', 'cat', 'chased', 'dog', 'the']

If we sort key/value pairs, we still get them in alphabetical order of the keys:
>>> sorted(counts.items())
[('and', 1), ('cat', 2), ('chased', 1), ('dog', 1), ('the', 3)]

We need to change the way that sorted() looks at the items that it is sorting. In particular, we want to map each key/value pair to the value, as this is what sorted() should consider for sorting. To do this, we define a function that maps pairs to the second element, and hand it on to sorted() as the sorting key function:

def second_of_pair(tuple):
    return tuple[1]
 

>>> second_of_pair( ("a", 1) )
1
>>> second_of_pair( ("the", 2) )
2

>>> sorted(counts.items(), key = second_of_pair)
[('and', 1), ('chased', 1), ('dog', 1), ('cat', 2), ('the', 3)]


To get the highest counts first, we do

>>> sorted(counts.items(), key = second_of_pair, reverse = True)
[('the', 3), ('cat', 2), ('and', 1), ('chased', 1), ('dog', 1)]


We only need the function second_of_pair() in one place: as a sorting key for sorted(). For cases like these, Python has "nameless functions". We can write

sorted(counts.items(), key = lambda pair: pair[1], reverse = True)

to get the same result as the previous call to sorted(). Note that while the separate function definition of second_of_pair() had to use the "return" statement, the lambda formulation cannot use it.

Inverting a dictionary / dictionaries with lists as values

Suppose you have a dictionary mapping German words to English words, as in

dedict = { "das":"the", "Mondschaf":"moonsheep", "steht":"stands"}


Then this dictionary can be converted to an English-to-German dictionary using

eddict = { }
for german, english in dedict.items():
    eddict[ english ] = german

>>> eddict
{'moonsheep': 'Mondschaf', 'the': 'das', 'stands': 'steht'}

But watch out: Dictionaries can map two different keys to the same value, but not one key to multiple values. So in the following case the inversion loses information:
>>> dedict = {"das":"the", "Mondschaf":"moonsheep", "steht":"stands", "der":"the"}
>>> for german, english in dedict.items():
...     eddict[ english ] = german
...

>>> eddict
{'moonsheep': 'Mondschaf', 'the': 'das', 'stands': 'steht'}

To solve the problem, we can create a dictionary in which the values are lists: Each English term is mapped to a list of German terms.

eddict = { }
for german, english in dedict.items():
    if english not in eddict:
            eddict[ english ] = [ german ]
    else:
            eddict[ english ].append(german)
 
>>> eddict
{'moonsheep': ['Mondschaf'], 'the': ['der', 'das'], 'stands': ['steht']}

Downloading news stories from BBC news (or whichever news page)


To do this, we first need to figure out what links to sub-pages with actual news stories look like on the newspaper page that we are interested in. For BBC news, here is the code:

# accessing material online
import urllib
# regular expression support
import re
# operating system things
import os

# BBC news main page
url = "http://www.bbc.com/news/"
# read the main page
bbcmain = urllib.urlopen(url).read()

write_files_to = "/Users/katrinerk/Desktop/bbcfiles/"

# find all addresses of subpages with stories
# assumption:
# they have the form
# <a class="story" href="...">
subpages = re.findall("<a class=\"story\"[^>]*href=\"(.*?)\"[^>]*>", bbcmain)

for subpage_url in subpages:
    # determine complete address of the subpage
    if subpage_url.startswith("http:"):
        # complete address, retrieve as is
        pass
    elif subpage_url.startswith("/news/"):
        # partial address, add BBC
        subpage_url = "http://www.bbc.com" + subpage_url
    else:
        # partial address, but we're not completely sure what it is
        subpage_url = "http://www.bbc.com" + subpage_url

    # retrieve data, store
    newname = os.path.basename(subpage_url)
    urllib.urlretrieve(subpage_url, filename = write_files_to + newname)

# extracting text and discarding HTML code
from bs4 import BeautifulSoup
import os

# read all files in directory write_files_to,
# store contents in text_contents (you may want to do something more meaningful with the text)
text_contents = [ ]
for filename in os.listdir(write_files_to):
    wholename = write_files_to + "/" + filename
    htmldoc = open(wholename).read()
    soup = BeautifulSoup(htmldoc)
    text_contents.append(soup.get_text())



Exception handling

# some commands can fail, for example a user-provided filename
# may not be the name of an existing file
filename = raw_input("Please give me the name of a file to open:")

# with try... except we can "catch" an error if it occurs,
# and process it in a sensible fashion
# rather than breaking off the program
try:
    f = open(filename)
except IOError, err:
    print "Could not open file:", err

# We can for example keep asking the user for a filename until
# we can finally open it:
success = False
while not(success):
    filename = raw_input("Please give me a filename:")
    success= True
    try:
        f = open(filename)
    except IOError:
        print "Could not open file, please try again."
        success = False

# It is also possible to write code that
# throws an error.
# This is a very useful way of signaling that
# some unexpected situation has been encountered.
def throws():
    raise RuntimeError("Hello, this is an error.")

try:
    throws()
except Exception, err:
    print "caught a runtime error, error message is:", err


# For more on working with Python errors, see
# http://www.doughellmann.com/articles/how-tos/python-exception-handling/
# http://python.about.com/od/gettingstarted/ss/begpyexceptions_all.htm
# https://en.wikibooks.org/wiki/Python_Programming/Exceptions

Walking directory structures and accessing any file on it


# read all files in all subdirectories of given directory
# for example, files from British National Corpus
# The following code just prints the names of all files and directories

import os

indir = "/Users/katrinerk/Data/"

for dirpath, dirnames, filenames in os.walk(indir):
    print "current directory is", dirpath
    print "subdirectories are:"
    for d in dirnames: print "\t", d
    print "files are:"
    for f in filenames: print "\t", f

# The following code provides a scaffold for accessing all files somewhere below a given directory

indir = "/Users/katrinerk/Data/penn-treebank-rel3/parsed/mrg/brown"
for dirpath, dirnames, filenames in os.walk(indir):
    for filename in filenames:
        actual_filename = dirpath + "/" + filename
        print actual_filename
        f = open(actual_filename)
        # process....
        f.close()


Counting words: NLTK FreqDist objects

import nltk

# open "Alice in Wonderland"
alice_text = open("/Users/katrinerk/Teaching/2012/fall/corpora/material/pg11.txt").read()
alice = nltk.word_tokenize(alice_text)

# NLTK FreqDist object: like a dictionary, but with keys sorted by frequency
fd = nltk.FreqDist(alice)

# the 10 most frequent words
fd.keys()[:10]
# the 20 most frequent words and their frequencies
for word, freq in fd.items()[:20]:
    print word, freq

# frequency of modals
modals = ['can', 'could', 'may', 'might', 'must', 'will']
for m in modals:
    print m, ":", fd[m]

# NLTK CondFreqDist object: keeping separate frequency distributions for different things
# bigrams: pairs of words
alice_bigrams = nltk.bigrams(alice)

# for each word, what is the frequency of next words?
cfd = nltk.ConditionalFreqDist(alice_bigrams)

# all the words that appear as 1st word in bigrams
# same as keys() for dictionaries
cfd.conditions()

# what are the entries for "Alice"?
cfd["Alice"]

# what are the 10 most frequent words to follow "Alice"?
cfd["Alice"].items()[:10]

# ConditionalFreqDist method tabulate():
# show contingency table for conditions cross-tabulated with samples
cfd.tabulate(conditions = ["Alice", "Hatter", "Hare", "Queen"], samples = ["said", "shouted", "replied", "shrieked"])


# from the NLTK book: ConditionalFreqDist mapping from Brown genre to word
from nltk.corpus import brown
genres_and_words = [ (genre, word) for genre in brown.categories() for word in brown.words(categories=genre)]
cfd_brown = nltk.ConditionalFreqDist(genres_and_words)
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd_brown.tabulate(conditions = genres, samples = modals)

Reading XML files using ElementTree

The following code uses this file:
<anthology>
  <poem source="Alice in Wonderland" length="short"><title>How doth the little</title>
    <author>Lewis Carroll</author>
    <stanza num="1st">
      <line>How doth the little crocodile</line>
       <line>Improve his shining tail,</line>
       <line>And pour the waters of the Nile</line>
       <line>On every golden scale! </line>
    </stanza>
    <stanza num="2nd">
       <line>How cheerfully he seems to grin,</line>
       <line>How neatly spread his claws, </line>
       <line>And welcome little fishes in</line>
       <line>With gently smiling jaws! </line>
    </stanza>
  </poem>
</anthology>

XML data structures can be read, manipulated, created, and written using ElementTree. We illustrate here just the reading:
# small demo of the ElementTree package
# see http://docs.python.org/2/library/xml.etree.elementtree.html
# and http://effbot.org/zone/element.htm
import xml.etree.ElementTree as ET

# tree will be an ElementTree
tree = ET.parse("/Users/katrinerk/Teaching/2012/fall/corpora/material/crocodile.xml")

# getting the root: this is an Element data structure
root = tree.getroot()
root

# Elements have tags
print root.tag

# We can access children like list elements
root[0]
root[0][2]

# We can iterate over children of a node
poem = root[0]
for child in poem:
    print child.tag

# The attributes of an XML element are available in a dictionary
poem.attrib.keys()
poem.attrib.items()
poem.attrib["source"]
# or more shortly
poem.keys()
poem.items()



Finding constituents in a Penn Treebank POS file


# Given a POS_tagged Penn Treebank file,
# use the [..] bracket information to
# identify constituents. Store each sentence
# as a list of constituents,
# where each constituent is a list of words
#
# Usage:
# python constituents.py <filename>

import sys

if len(sys.argv) < 2:
    print "Error: I need a Penn Treebank file as a command line argument."
    sys.exit()

f = open(sys.argv[1])

# iterate through the lines of the file,
# recover word/POS pairs and remember the POS seen for each word

# first just store a list of sentences
# each sentence a list of words and brackets
sentences_raw = [ ]
current_sentence = [ ]

for line in f:
    if line.startswith("*x*"):
        # preamble, skip
        pass
 
    elif line.strip() == "":
        # empty line, skip
        pass

    elif line.startswith("======"):
        # sentence boundary line:
        # store current sentence (unless it's empty)
        if len(current_sentence) > 0:
            sentences_raw.append(current_sentence)
        current_sentence = [ ]

    else:
        # This is an interesting line
        # split it into words
        # and store it in current_sentence
        current_sentence.extend(line.split())

# done iterating through lines in the file
# but current_sentence may still contain the last, unsaved sentence
if len(current_sentence) > 0:
    sentences_raw.append(current_sentence)

# now re-encode each sentence: separate into constituents
sentences = [ ]
for sentence in sentences_raw:
    current_constituent = [ ]
    inside_brackets = False
    current_sentence = [ ]

    for word in sentence:
        if word == "[":
            # start of constituent
            inside_brackets = True

        elif word == "]":
            # end of constituent:
            # save current constituent
            if len(current_constituent) > 0:
                current_sentence.append(current_constituent)
            current_constituent = [ ]
            inside_brackets = False
           
        else:
            # a word, not a bracket
            # save in the current constituent
            current_constituent.append(word)
            # if we are not inbetween brackets,
            # we have a one-word constituent: save it right away
            if not(inside_brackets):
                current_sentence.append(current_constituent)
                current_constituent = [ ]
        # end if
    # end: for word in sentence

    sentences.append(current_sentence)
# end: for sentence in raw_sentences

# now print the constituents of each sentence
sentno = 0
for sentence in sentences:
    print "Constituents of sentence", sentno, ":"
    for const in sentence:
        print "\t", " ".join(const)
    sentno += 1
    raw_input()

       

Analyzing Penn Treebank parse files

What we do here is extract all NPs that are marked as subject.

# Determining subject noun phrases in a Penn treebank file
import sys

############
# discover the subject NP that starts at
# the first character of the string given as parameter
def discover_npsubj(text):
    opening_count = 0
    closing_count = 0
    for index in range(len(text)):
        if text[index] == "(": opening_count += 1
        if text[index] == ")": closing_count += 1

        if opening_count == closing_count:
            # we've reached the end of the subject NP.
            # Return the text up to the current index,
            # i.e. use a slice that ends before index+1
            return text[:index+1]

    # we have reached the end of the text and haven't found
    # a matching amount of brackets. Something is wrong.
    raise RuntimeError("Parenthesis mismatch")

#####
# execution starts here
#####

if len(sys.argv) < 2:
    print "Please give the name of a"
    print "Penn Treebank file with syntactic annotation"
    print "as a command line parameter"
    sys.exit()
   
f = open(sys.argv[1])

filecontents = ""
for line in f:
    if line.startswith("*x*"):
        # preamble, skip
        pass
 
    else:
        # This is an interesting line
        # look for a subject NP
        # The call to strip() removes the linebreaks
        filecontents += line.strip()

for index in range(len(filecontents)):
    if filecontents[index:].startswith("(NP-SBJ"):
        try:
            result = discover_npsubj(filecontents[index:])
        except Error, err:
            print "ERROR at", index, ":", err
        else:
            print result




Comments