Courses‎ > ‎Python worksheets‎ > ‎

Python worksheet: accessing and processing text

Automatically accessing and downloading text from the web

We have used texts from Project Gutenberg before. But so far, we have always downloaded the texts manually. We can also do this automatically, which is useful when we want to access larger amounts of texts. (For a single text, it may be faster to do the downloading by hand, since it means we don't have to figure out the naming scheme of the site we are interested in.) To download web content using Python, we can use the Python package urllib at http://docs.python.org/library/urllib.html


The first step is to find out the URL of the file or files that we are interested in. So let's assume we are interested in John Donne, "Devotions Upon Emergent Occasions". Note that the Gutenberg main page does not allow automatic access, as stated at http://www.gutenberg.org/wiki/Gutenberg:Terms_of_Use. But there are mirrors for which automatic access is allowed.




















"Devotions Upon Emergent Occasions" by John Donne is located at "ftp://sailor.gutenberg.lib.md.us/gutenberg/2/3/7/7/23772/23772.txt". We can now access it as follows:

import urllib url = "ftp://sailor.gutenberg.lib.md.us/gutenberg/2/3/7/7/23772/23772.txt"
f = urllib.urlopen(url)
raw = f.read()
f.close()
raw[:1000]

As you can see, opening and reading a web page works in almost the same way as opening and reading a local file: We start with

f = urlopen(...)

and then we can access the data with f.read(), as if it were a local file.

Sometimes, we may want to download web content automatically, but would like to store it in files rather than process it directly. The urllib package supports this too:

import urllib
url = "ftp://sailor.gutenberg.lib.md.us/gutenberg/2/3/7/7/23772/23772.txt"
# download the text, and store on my desktop in "donne.txt"
urllib.urlretrieve(url, "/Users/katrinerk/Desktop/donne.txt")



We can now process this web content just like normal text. As a first step, we will break it up into words. We will use nltk.word_tokenize() rather than split(). Here is an example of how the two differ:

>>> "This is an example.".split()
['This', 'is', 'an', 'example.']
>>> nltk.word_tokenize("This is an example.")
['This', 'is', 'an', 'example', '.']

So nltk.word_tokenize() is a bit smarter in its handling of punctuation. Here is how we can tokenize the Gutenberg text that we just downloaded. Afterwards, we can load the result into the nltk.Text() format, and inspect it.

raw = open("/Users/katrinerk/Desktop/donne.txt").read()
tokens = nltk.word_tokenize(raw)
>>> type(tokens) <type 'list'> >>> len(tokens) 89031 >>> tokens[:20]
['Project', 'Gutenberg', "'s", 'Devotions', 'Upon', 'Emergent', 'Occasions',
',', 'by', 'John', 'Donne', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of',
'anyone', 'anywhere']


>>> text = nltk.Text(tokens)
>>> text.collocations()
Project Gutenberg-tm; thou hast; thy Son; Mr. Donne; Project
Gutenberg; Literary Archive; Dr. Donne; Gutenberg-tm electronic; thou
wilt; gracious God; Gutenberg Literary; St. Paul; Sir Robert; Sir
George; Holy Ghost; thy servant; Archive Foundation; thou didst;
electronic works; United States

Handling HTML input

The Gutenberg file we just downloaded was plain text. But a lot of data on the web is in HTML instead.HTML looks a lot like XML, but the tags it uses are pre-defined and are interpreted by browsers as formatting commands.

We can read HTML files using urlopen() again, and use BeautifulSoup to remove some of the HTML. BeautifulSoup is a package that you need to install on your machine before you can use it. You find it at http://www.crummy.com/software/BeautifulSoup/.


url = "http://news.bbc.co.uk/2/hi/health/2284783.stm" html = urllib.urlopen(url).read()
>>>
html[:60] '<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
text = soup.get_text()

By the way: If you wanted to automatically download current news stories from the BBC webpage for processing, how could you do that? They are linked from the BBC main page -- but what are the URLs of these subpages?

  • First step: use urllib.urlopen() to access the page "http://www.bbc.co.uk/news/"

  • The resulting file has a lot of formatting commands. You don't want to discard them at this point, because they will also contain the links to the actual news stories that you want to access. Instead, try to use the Python string method find() to locate the part in the main page that pertains to the first news story. Can you now figure out how to (1) locate passages in the BBC main text file that describe links to actual news stories, (2) determine the URL for those news stories?


Text encoding

Many of you will be dealing with texts in different languages. Internally, characters are encoded through character numbers. Some characters (A-Z, a-z) privileged historically in that they have received shorter encodings. Unicode provides encodings for a huge number of additional alphabets. See https://en.wikipedia.org/wiki/Unicode

Within Python, unicode strings can be handled just like other strings. But for storing in files and display on screen, it is necessary to encode them. To do that, we need the Python codecs package. We will also need to know the encoding that a file uses.

As an example, we use a text in Portuguese, from Project Gutenberg: A Revolução Portugueza: O 31 de Janeiro (Porto 1891) by Francisco Jorge de Abreu, at  http://www.gutenberg.org/ebooks/29484

Download the plain text version to a local file. The Project Gutenberg page informs us that the text is encoded in UTF8, a Unicode encoding. We need to specify this as we open the file, in order to decode it: 

import codecs
f = codecs.open("/Users/katrinerk/Downloads/pg29484.txt", 'r', "utf-8")
text = f.read()

>>> print text.encode('utf-8')[:100]
The Project Gutenberg EBook of A Revolução Portugueza: O 31 de Janeiro
(Porto 1891), by Jorge de

# If we try to inspect it without encoding, we see some mis-represented characters

>>> text[:100]
u'The Project Gutenberg EBook of A Revolu\xe7\xe3o Portugueza: O 31 de Janeiro\r\n(Porto 1891), by Jorge de Ab'


It is important that we specify the encoding when opening the text. If we do not do that, the assumption is that it is ASCII text. In that case, we cannot later encode it for printing:

>>> f = open("/Users/katrinerk/Downloads/pg29484.txt")
>>> text = f.read()
>>> text.encode('utf-8')[:100]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 39: ordinal not in range(128)


More information on working with Unicode in Python is at http://www.evanjones.ca/python-utf8.html

To write Unicode to a file, again use codecs.open():

outf = codecs.open("/Users/katrinerk/Desktop/test.txt", "w", "utf-8")
print >> outf, text[:200]
outf.close()
f = codecs.open("/Users/katrinerk/Desktop/test.txt", "r", "utf-8")
newtext = f.read()
f.close()
print newtext.encode('utf-8')
Comments