Worksheet: Python conditions, lists, and loops
Counting words in a text
Suppose we wanted to count how often each word occurs in the English Wikipedia page for Monty Python. Then a natural language description of the algorithm (the method) could look approximately like this:
Split the text into words.
Start a sheet of counts.
Start at the first word, and for each word, do the following:
If the word is already on the counts sheet, add a dash next to it.
If the word is not on the counts sheet, put it there, and put a single dash next to it.
This algorithm uses:
Different data types:
strings (words), integers (counts), a list of words (the text, after splitting), and a mapping from words to counts, a data type that in Python is called a "dictionary"
Operations on numbers, in this case, adding
Operations on strings, in this case, splitting a text into words and comparing a word to others on the sheet
Conditions: "If the word is already on the counts sheet..."
Repetition: "for each word, do the following"
Although this is a very simple algorithm, it uses almost all the main data types and types of commands that we will encounter in this introduction to Python. This worksheet introduces three of them:
Conditions ("if the word is already on the counts sheet...")
Lists (for example, lists of words), and
Loops ("for each word, do the following").
Boolean expressions
Boolean expressions are expressions whose value is either True or False. They play an important role in conditions. You have already encountered some of them when we were discussing strings: "in", "startswith", and "endswith".
>>> "eros" in "rhinoceros"
True
>>> "nose" in "rhinoceros"
False
>>> "truism".endswith("ism")
True
>>> "inconsequential".startswith("pro")
False
Here are some Boolean expressions that compare numbers:
>>> 3.1 >= 2.9
True
>>> 3.1 < 2.9
False
>>> 3.2 - 0.1 == 3.1
True
>>> def celsius_2_fahrenheit(temp):
... return temp * 9/5 + 32
>>> celsius_2_fahrenheit(20) > 70
False
We can compare expressions that yield a number using <, >, <=, and >=. "Equals" is expressed using ==, and "does not equal" is "!=". Note that it's "==" if we want to do a comparison. "=" assigns a value.
The same operators work for strings as well. What do you think they do?
>>> "rhinoceros" != "rhino"
True
>>> "rhinoceros" == "rhino"
False
>>> "armadillo" < "rhinoceros"
True
>>> "elephant" > "mouse"
False
The type of such expressions is 'bool':
>>> type("elephant" > "mouse")
<class 'bool'>
Conditions
In natural language, you express conditions using "if": "If this word is not yet on our count sheet, put it there". In Python, it is very similar:
>>> if "mad" in "armadillo":
... print( "yes." )
...
yes.
Note the notation:
"if"
then a Boolean expression
then a colon
then, indented, what to do if the Boolean expression is True.
Or, to put it another way:
if <Boolean>:
block
This is very similar to the notation for defining your own functions, and in fact you will see the same notation again in other contexts.
Sometimes we have two different things that we want to do, depending on whether a Boolean expression is True or False. For example, the math package function log() gives you an error message if you try to compute the logarithm of a number that is zero or negative. This can be a problem for some formulas, and people often get around this problem by using 0 in their computation whenever the log is undefined.
>>> import math
>>> math.log(0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: math domain error
>>> def log_or_zero(value):
... if value > 0:
... return math.log(value)
... else:
... return 0.0
...
>>> log_or_zero(0)
0.0
>>> log_or_zero(0.333)
-1.0996127890016931
So the general shape of if...else... commands is
if <Boolean>:
block
else:
block
Note the indentation levels!
If you want to use more than two cases, the command is if... elif... elif... else... The following function guesses at a word's part of speech based on its suffix. It guesses "noun" for words ending in "ion" or "ism", "verb" for words ending in "ed", "adjective" if a word ends in "able", and if it has no better idea, it will just guess "noun".
>>> def simple_part_of_speech(word):
... if word.endswith("ion"):
... return "N"
... elif word.endswith("ism"):
... return "N"
... elif word.endswith("ed"):
... return "V"
... elif word.endswith("able"):
... return "A"
... else:
... return "N"
Try it for yourself:
Make a function that takes 2 numbers as arguments and returns the larger one. Use "if" to do this.
Make a function that filters out function words: It should take a string as an argument. If the word is "the", "a", "in", or "and", it should return None (a special value in Python, not "None"!). Otherwise it should return the word unchanged.
Make a function that tried to identify nouns: It should take a string as an argument. If that string starts with an uppercase letter, or if it ends in "ion" or "ism", it should return True, otherwise False.
Sometimes you need to combine multiple Boolean expression. For example, suppose you wanted to define medium-length words as words with at least 4 and at most 10 letters. Here is a function that returns True for medium-length words, and False for others:
>>> def is_medlength(word):
... if len(word) >= 4 and len(word) <= 10:
... return True
... else:
... return False
...
>>> is_medlength("Python")
True
>>> is_medlength("to")
False
>>> is_medlength("disestablishmentarianism")
False
The central point is: if len(word) >= 4 and len(word) <= 10
There is also "or", as in: if len(word) < 4 or len(word) > 10: return False
The reserved word "not" flips the value of a Boolean expression:
>>> "apple" == "orange"
False
>>> not("apple" == "orange")
True
Try it for yourself: Write another version of the function that filters out function words. Instead of using if...elif..elif..else, use a single "if" condition, but combine Boolean expressions using "and" or "or".
Lists
A list is a sequence of items, like a shopping list. In Python, you write them with straight brackets around them, and with commas between items:
>>> shopping_list = ["eggs", "milk", "broccoli"]
>>> len(shopping_list)
3
You can make lists with only one item, or even no items on them:
>>> unary = [ 123 ]
>>> len(unary)
1
>>> type(unary)
<class 'list'>
>>> type(123)
<class 'int'>
>>> empty = [ ]
>>> len(empty)
0
As you can see, the list [123] with the number 123 as its single item is still a list, and has a different type from the number 123. And a note: It does not matter whether you write [123] or [ 123 ], and whether you write [] or [ ]. They are the same to Python. I like to add spaces for better readability.
You can even have a nested list, a list whose items are lists again:
>>> nested = [ [1], ["a", "b"], []]
What do you think is len(nested) ?
Lists and strings
There are many operations in Python that work for lists like they do for strings. Here are some of the most important ones:
We have used indices to address individual characters on a string, and slices to carve out substrings. (See the Worksheet: First steps in Python.) You can use indices and slices in the same way on lists.
Here is how indices can be used with lists:
>>> mylist = ["acrimonious", "acarus", "caucus"]
>>> mylist[0]
'acrimonious'
>>> mylist[1:3]
['acarus', 'caucus']
>>> mylist[1:2]
['acarus']
>>> mylist[2:]
['caucus']
Now assume that
>>> mylist = ["absurdism", "antiferromagnetism", "bipedalism", "bimetallism"]
Can you access just the word "bipedalism" by using an index on mylist? Can you access just the sublist ["absurdism", "antiferromagnetism"]?
But you can use indices on a list to do something that you cannot do with an index on a string: Change an item on a list. Here is an example:
>>> mylist[1] = "eudaimonism"
>>> mylist
['absurdism', 'eudaimonism', 'bipedalism', 'bimetallism']
Using "+" on strings concatenates them, for example
>>> "hello " + "world"
'hello world'
What do you think will happen when you do the following?
>>> mylist + ["catastrophism", "catabolism"]
len() works on lists and on strings
You have used in to check for substrings:
>>> "tall" in "bimetallism"
True
You can also use in to check for items on a list:
>>> "absurdism" in mylist
True
>>> "grangerism" in mylist
False
Here is a function that works on lists but not on strings: You can append another item to the end of a list using the function append().
>>> mylist.append("malapropism")
>>> mylist
['absurdism', 'eudaimonism', 'bipedalism', 'bimetallism', 'malapropism']
An important function that we have mentioned before is split(). It works on strings and splits them up into a list of substrings. If given no arguments, it splits on whitespace.
>>> "this is a sentence".split()
['this', 'is', 'a', 'sentence']
Try it out for yourself:
What are ways to exchange the 3rd item on mylist for "literalism"? Find at least 2 different commands that do this.
Suppose
nested = [ [1], ["a", "b"], []]
How can you determine the length of the list that is the second item on nested (that is, the list ["a", "b"]) ?
Can you change nested to be [[2], ['a', 'b'], []] ?
Use Python to determine how many words there are in the following first lines of a poem (by Lewis Carroll):
"""They told me you had been to her
And mentioned me to him;
She gave me a good character,
But said I could not swim. """
Loops
Up to now, we have only accessed individual items on a list by using their indices. But one of the most natural things to do with a list is to repeat some action for each item on the list, for example: “For each word in the given list of words: print it”.
Here is how to say this in Python:
>>> my_list = [ "ngram", "isogram", "cladogram", "pangram"]
>>> for word in my_list:
... print( word )
There are three things to note here. First, the reserved word that signals repetition is "for".
Second, the overall shape of the "for" loop is very similar to that of a conditon and of a function definition: It uses a colon at the end of the line, and an indented block.
for <variable> in <something>:
block
Third, “word” is a variable. You could have chosen a different variable name, of course:
>>> for abcd123 in my_list:
... print( abcd123 )
The variable in the loop is like the variable in a function definition: You don't need to specify its contents beforehand. In fact, whatever was in abcd123 before the loop gets erased:
>>> abcd123 = "hello"
>>> for abcd123 in ["ngram", "isogram", "cladogram", "pangram"]:
... print( abcd123 )
...
ngram
isogram
cladogram
pangram
>>> abcd123
'pangram'
Typically you will choose as the loop variable one that you haven't used before. In the loop, Python fills it with each item on the list in turn. First, it puts "ngram" in abcd123. This, then, is printed within the block. Then it puts "isogram" into abcd123, and the block is executed with this value of abcd123. In the third execution of the loop, abcd123 is "cladogram", and the fourth time, it is "pangram". Then the list is exhausted, and the loop is done.
Here is another example of a for-loop.
>>> my_list = ["ngram", "isogram", "cladogram", "pangram"]
>>> counter = 0
>>> for whatever in my_list:
... counter = counter + 1
...
>>> counter
4
This code does the same thing as len(my_list) . But it illustrates a general pattern that you will see very often: You initialize a counter (here: to zero), then you iterate over the list, and change the counter. Here is another example, which adds up all the numbers on a list.
>>> numberlist = [345, 52, 1034, 79421]
>>> mysum = 0
>>> for number in numberlist:
... mysum = mysum + number
...
>>> mysum
80852
This code does the same as sum(numberlist) .
Try it for yourself:
Here is a list of words:
[ "candygram", "preprogram", "picogram"]
How many a's do the words on this list contain, taken together?
Let's use again the poem lines we used before:
"""They told me you had been to her
And mentioned me to him;
She gave me a good character,
But said I could not swim. """
Can you count how often the word "me" occurs in this text?
Here is another useful and frequent code pattern. Often, we want to collect results in a list as we go through a for-loop. For example, we may want to collect all uppercase words from a given text. The following text is from the Wikipedia page on Monty Python.
>>> mytext = """The Python programming language by Guido van Rossum is named after the troupe, and Monty Python references are often found in sample code created for that language. Additionally, a 2001 April Fool's Day joke by van Rossum and Larry Wall involving the merger of Python with Perl was dubbed "Parrot" after the Dead Parrot Sketch. The name "Parrot" was later used for a project to develop a virtual machine for running bytecode for interpreted languages such as Perl and Python. Also, the Jet Propulsion Laboratory wrote some spacecraft navigation software in Python, which they dubbed "Monty". There is also a python refactoring tool called bicyclerepair ( [1] ), named after Bicycle Repair Man sketch."""
>>> words = mytext.split()
>>>
>>> uppercase_words = [ ]
>>> for word in words:
... if word[0].isupper():
... uppercase_words.append(word)
...
>>> uppercase_words
['The', 'Python', 'Guido', 'Rossum', 'Monty', 'Python', 'Additionally,', 'April', "Fool's", 'Day', 'Rossum', 'Larry', 'Wall', 'Python', 'Perl', 'Dead', 'Parrot', 'Sketch.', 'The', 'Perl', 'Python.', 'Also,', 'Jet', 'Propulsion', 'Laboratory', 'Python,', 'There', 'Bicycle', 'Repair', 'Man']
This code first splits the text into words. It then initializes a list uppercase_words, in which we want to collect results. Initially, that list is zero. We then iterate through all words of our text and check if they start with an uppercase letter, that is, if word[0] is a string consisting entirely of uppercase letters (see http://docs.python.org/3/library/stdtypes.html#sequence-types-str-unicode-list-tuple-bytearray-buffer-xrange for the use of isupper()). If that is the case, we append word to our collector list uppercase_words.
Try it for yourself:
Collect all words that end in "t" in the same text mytext that we just used.
Ranges
Suppose you wanted to have a list of the form [0, 1, 2, 3, 4], that is, a series of consecutive numbers. Then you can do that as before:
>>> my_list = [0,1,2,3,4]
But since this is a kind of data structure that is needed relatively often, Python has a shortcut for this:
>>> my_range = range(5)
>>> my_range
range(0,5)
>>> list(my_range)
[0, 1, 2, 3, 4]
range(n) yields a range (something similar to a list) that starts at 0 and ends at n-1 (not n!). This is like with list slices: Remember that my_list[1:4] gave you the part of the list that started at index 1 and ended at index 3.
You can also use range() with two parameters instead of one.
range(j, k) yields the numbers from j to k-1:
>>> list(range(20, 23))
[20, 21, 22]
And what does this do? And why?
>>> list(range(20, 30, 2))
You can use range() to count to ten:
>>> for num in range(1, 11):
. . . print( num )
Try it for yourself:
How can you use range() to sum up the numbers from 1 to 20?
You can also use range() to access the position of items on a list.
>>> mylist = [ "coble", "noble", "roble" ]
>>> for i in range(len(mylist)):
... print( i, mylist[ i ] )
...
0 coble
1 noble
2 roble
Here is what this does: range(len(mylist)) is range(3), which corresponds to the list[0, 1, 2]. The for-loop iterates over the numbers 0, 1, and 2. For each of them, it prints the number, then the list item with that index.
A more complex task
When we process text, the first step is almost always to break it up into words, which we can then count, label, or otherwise process. The Python string function split() gets us most of the way there. It splits up text on whitespace. But the result is not perfect:
>>> sentence = "Computational linguistics (sometimes also referred to as natural language processing, NLP) is a highly interdisciplinary area."
>>> sentence.split()
['Computational', 'linguistics', '(sometimes', 'also', 'referred', 'to', 'as', 'natural', 'language', 'processing,', 'NLP)', 'is', 'a', 'highly', 'interdisciplinary', 'area.']
Some of the items on the resulting list are words with punctuation attached to them, for example "(sometimes" or "processing,". This is a problem: When we count words, "(sometimes" and "sometimes" count as different strings -- which is not what we want. Instead, we would like punctuation to be separate, like this:
[ 'Computational', 'linguistics', '(', 'sometimes', 'also', 'referred', 'to', 'as', 'natural', 'language', 'processing', ',', 'NLP', ')', 'is', 'a', 'highly', 'interdisciplinary', 'area', '.' ]
Write a tokenizer that splits text into words and puts punctuation into separate strings. One Python data structure that may be helpful is a collection of punctuation symbols that is available in the string package:
>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
However, the problem does not end there: In some cases, we do not want the punctuation to be split off, because it is part of the word, like in "U.S.A." or in "Dr.". Can you extend your method to deal with (some of) these cases? Your approach need not be perfect (and approaches to processing natural language in some way mostly aren't), just as accurate as you can make it.