Counting words in a text
Suppose we wanted to count how often each word occurs in the English Wikipedia page for Monty Python. Then a natural language description of the algorithm (the method) could look approximately like this:
This algorithm uses:
this is a very simple algorithm, it uses almost all the main data types
and types of commands that we will encounter in this introduction to
Python. This worksheet introduces three of them:
expressions are expressions whose value is either True or False. They
play an important role in conditions. You have already encountered some
of them when we were discussing strings: "in", "startswith", and
Here are some Boolean expressions that compare numbers:
can compare expressions that yield a number using <, >, <=,
and >=. "Equals" is expressed using ==, and "does not equal" is "!=".
Note that it's "==" if we want to do a comparison. "=" assigns a value.
The same operators work for strings as well. What do you think they do?
The type of such expressions is 'bool':
In natural language, you express conditions using "if": "If this word is not yet on our count sheet, put it there". In Python, it is very similar:
Note the notation:
Or, to put it another way:
This is very similar to the notation for defining your own functions, and in fact you will see the same notation again in other contexts.
we have two different things that we want to do, depending on whether a
Boolean expression is True or False. For example, the math package
function log() gives you an error message if you try to compute the
logarithm of a number that is zero or negative. This can be a problem
for some formulas, and people often get around this problem by using 0
in their computation whenever the log is undefined.
So the general shape of if...else... commands is
Note the indentation levels!
you want to use more than two cases, the command is if... elif...
elif... else... The following function guesses at a word's part of
speech based on its suffix. It guesses "noun" for words ending in "ion"
or "ism", "verb" for words ending in "ed", "adjective" if a word ends in
"able", and if it has no better idea, it will just guess "noun".
Try it for yourself:
Sometimes you need to
combine multiple Boolean expression. For example, suppose you wanted to
define medium-length words as words with at least 4 and at most 10
letters. Here is a function that returns True for medium-length words,
and False for others:
The central point is: if len(word) >= 4 and len(word) <= 10
There is also "or", as in: if len(word) < 4 or len(word) > 10: return False
The reserved word "not" flips the value of a Boolean expression:
Try it for yourself: Write another version of the function that filters out function words. Instead of using if...elif..elif..else, use a single "if" condition, but combine Boolean expressions using "and" or "or".
A list is a sequence of items, like a shopping list. In Python, you write them with straight brackets around them, and with commas between items:
You can make lists with only one item, or even no items on them:
As you can see, the list  with the number 123 as its single item is still a list, and has a different type from the number 123. And a note: It does not matter whether you write  or [ 123 ], and whether you write  or [ ]. They are the same to Python. I like to add spaces for better readability.
You can even have a nested list, a list whose items are lists again:
What do you think is len(nested) ?
Lists and strings
There are many operations in Python that work for lists like they do for strings. Here are some of the most important ones:
An important function that we have mentioned before is split(). It works on strings and splits them up into a list of substrings. If given no arguments, it splits on whitespace.
Try it out for yourself:
Up to now, we have only accessed individual items on a list by using their indices. But one of the most natural things to do with a list is to repeat some action for each item on the list, for example: “For each word in the given list of words: print it”.
Here is how to say this in Python:
There are three things to note here. First, the reserved word that signals repetition is "for".
Second, the overall shape of the "for" loop is very similar to that of a conditon and of a function definition: It uses a colon at the end of the line, and an indented block.
Third, “word” is a variable. You could have chosen a different variable name, of course:
The variable in the loop is like the variable in a function definition: You don't need to specify its contents beforehand. In fact, whatever was in abcd123 before the loop gets erased:
Typically you will choose as the loop variable one that you haven't used before. In the loop, Python fills it with each item on the list in turn. First, it puts "ngram" in abcd123. This, then, is printed within the block. Then it puts "isogram" into abcd123, and the block is executed with this value of abcd123. In the third execution of the loop, abcd123 is "cladogram", and the fourth time, it is "pangram". Then the list is exhausted, and the loop is done.
Here is another example of a for-loop.
This code does the same thing as len(my_list) . But it illustrates a general pattern that you will see very often: You initialize a counter (here: to zero), then you iterate over the list, and change the counter. Here is another example, which adds up all the numbers on a list.
This code does the same as sum(numberlist) .
Try it for yourself:
Here is another useful and frequent code pattern. Often, we want to collect results in a list as we go through a for-loop. For example, we may want to collect all uppercase words from a given text. The following text is from the Wikipedia page on Monty Python.
This code first splits the text into words. It then initializes a list uppercase_words, in which we want to collect results. Initially, that list is zero. We then iterate through all words of our text and check if they start with an uppercase letter, that is, if word is a string consisting entirely of uppercase letters (see http://docs.python.org/3/library/stdtypes.html#sequence-types-str-unicode-list-tuple-bytearray-buffer-xrange for the use of isupper()). If that is the case, we append word to our collector list uppercase_words.
Try it for yourself:
Collect all words that end in "t" in the same text mytext that we just used.
Suppose you wanted to have a list of the form [0, 1, 2, 3, 4], that is, a series of consecutive numbers. Then you can do that as before:
But since this is a kind of data structure that is needed relatively often, Python has a shortcut for this:
range(n) yields a range (something similar to a list) that starts at 0 and ends at n-1 (not n!). This is like with list slices: Remember that my_list[1:4] gave you the part of the list that started at index 1 and ended at index 3.
You can also use range() with two parameters instead of one.
range(j, k) yields the numbers from j to k-1:
And what does this do? And why?
You can use range() to count to ten:
Try it for yourself:
How can you use range() to sum up the numbers from 1 to 20?
You can also use range() to access the position of items on a list.
Here is what this does: range(len(mylist)) is range(3), which corresponds to the list[0, 1, 2]. The for-loop iterates over the numbers 0, 1, and 2. For each of them, it prints the number, then the list item with that index.
A more complex task
we process text, the first step is almost always to break it up into
words, which we can then count, label, or otherwise process. The Python
string function split() gets us most of the way there. It splits up text
on whitespace. But the result is not perfect:
Some of the items on the resulting list are words with punctuation attached to them, for example "(sometimes" or "processing,". This is a problem: When we count words, "(sometimes" and "sometimes" count as different strings -- which is not what we want. Instead, we would like punctuation to be separate, like this:
Write a tokenizer that splits text into words and puts punctuation into separate strings. One Python data structure that may be helpful is a collection of punctuation symbols that is available in the string package:
However, the problem does not end there: In some cases, we do not want the punctuation to be split off, because it is part of the word, like in "U.S.A." or in "Dr.". Can you extend your method to deal with (some of) these cases? Your approach need not be perfect (and approaches to processing natural language in some way mostly aren't), just as accurate as you can make it.