Counting words in a text Suppose we wanted to count how often each word occurs in the English Wikipedia page for Monty Python. Then a natural language description of the algorithm (the method) could look approximately like this:
This algorithm uses:
Although
this is a very simple algorithm, it uses almost all the main data types
and types of commands that we will encounter in this introduction to
Python. This worksheet introduces three of them:
Boolean expressions Boolean
expressions are expressions whose value is either True or False. They
play an important role in conditions. You have already encountered some
of them when we were discussing strings: "in", "startswith", and
"endswith". >>> "eros" in "rhinoceros" True >>> "nose" in "rhinoceros" False >>> "truism".endswith("ism") True >>> "inconsequential".startswith("pro") False Here are some Boolean expressions that compare numbers: >>> 3.1 >= 2.9 True >>> 3.1 < 2.9 False >>> 3.2 - 0.1 == 3.1
... return temp * 9/5 + 32
False We
can compare expressions that yield a number using <, >, <=,
and >=. "Equals" is expressed using ==, and "does not equal" is "!=".
Note that it's "==" if we want to do a comparison. "=" assigns a value.
The same operators work for strings as well. What do you think they do? >>> "rhinoceros" != "rhino" True >>> "rhinoceros" == "rhino" False >>> "armadillo" < "rhinoceros" True >>> "elephant" > "mouse" False The type of such expressions is 'bool': >>> type("elephant" > "mouse") <class 'bool'> Conditions In natural language, you express conditions using "if": "If this word is not yet on our count sheet, put it there". In Python, it is very similar: >>> if "mad" in "armadillo": ... print( "yes." ) ... yes. Note the notation:
Or, to put it another way:
This is very similar to the notation for defining your own functions, and in fact you will see the same notation again in other contexts. Sometimes
we have two different things that we want to do, depending on whether a
Boolean expression is True or False. For example, the math package
function log() gives you an error message if you try to compute the
logarithm of a number that is zero or negative. This can be a problem
for some formulas, and people often get around this problem by using 0
in their computation whenever the log is undefined. >>> import math >>> math.log(0) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: math domain error >>> def log_or_zero(value): ... if value > 0: ... return math.log(value) ... else: ... return 0.0 ... So the general shape of if...else... commands is if <Boolean>: block else: block If
you want to use more than two cases, the command is if... elif...
elif... else... The following function guesses at a word's part of
speech based on its suffix. It guesses "noun" for words ending in "ion"
or "ism", "verb" for words ending in "ed", "adjective" if a word ends in
"able", and if it has no better idea, it will just guess "noun". >>> def simple_part_of_speech(word): ... if word.endswith("ion"): ... return "N" ... elif word.endswith("ism"): ... return "N" ... elif word.endswith("ed"): ... return "V" ... elif word.endswith("able"): ... return "A" ... else: ... return "N" Try it for yourself:
Sometimes you need to
combine multiple Boolean expression. For example, suppose you wanted to
define medium-length words as words with at least 4 and at most 10
letters. Here is a function that returns True for medium-length words,
and False for others: >>> def is_medlength(word): ... if len(word) >= 4 and len(word) <= 10: ... return True ... else: ... return False ... >>> is_medlength("Python") True >>> is_medlength("to") False >>> is_medlength("disestablishmentarianism") False The central point is: if len(word) >= 4 and len(word) <= 10 There is also "or", as in: if len(word) < 4 or len(word) > 10: return False The reserved word "not" flips the value of a Boolean expression: >>> "apple" == "orange" False >>> not("apple" == "orange") True Try it for yourself: Write another version of the function that filters out function words. Instead of using if...elif..elif..else, use a single "if" condition, but combine Boolean expressions using "and" or "or". Lists A list is a sequence of items, like a shopping list. In Python, you write them with straight brackets around them, and with commas between items: >>> shopping_list = ["eggs", "milk", "broccoli"] >>> len(shopping_list) 3 You can make lists with only one item, or even no items on them: >>> unary = [ 123 ] >>> len(unary) 1 >>> type(unary) <class 'list'> >>> type(123) <class 'int'> >>> empty = [ ] >>> len(empty) 0 As you can see, the list [123] with the number 123 as its single item is still a list, and has a different type from the number 123. And a note: It does not matter whether you write [123] or [ 123 ], and whether you write [] or [ ]. They are the same to Python. I like to add spaces for better readability. You can even have a nested list, a list whose items are lists again: >>> nested = [ [1], ["a", "b"], []] Lists and strings There are many operations in Python that work for lists like they do for strings. Here are some of the most important ones:
>>> mylist.append("malapropism") >>> mylist ['absurdism', 'eudaimonism', 'bipedalism', 'bimetallism', 'malapropism'] An important function that we have mentioned before is split(). It works on strings and splits them up into a list of substrings. If given no arguments, it splits on whitespace. >>> "this is a sentence".split() ['this', 'is', 'a', 'sentence'] Try it out for yourself:
Loops Up to now, we have only accessed individual items on a list by using their indices. But one of the most natural things to do with a list is to repeat some action for each item on the list, for example: “For each word in the given list of words: print it”. Here is how to say this in Python: >>> my_list = [ "ngram", "isogram", "cladogram", "pangram"] >>> for word in my_list: ... print( word ) There are three things to note here. First, the reserved word that signals repetition is "for". Second, the overall shape of the "for" loop is very similar to that of a conditon and of a function definition: It uses a colon at the end of the line, and an indented block. for <variable> in <something>: block Third, “word” is a variable. You could have chosen a different variable name, of course: >>> for abcd123 in my_list: ... print( abcd123 ) The variable in the loop is like the variable in a function definition: You don't need to specify its contents beforehand. In fact, whatever was in abcd123 before the loop gets erased: >>> abcd123 = "hello" >>> for abcd123 in ["ngram", "isogram", "cladogram", "pangram"]: ... print( abcd123 ) ... ngram isogram cladogram pangram >>> abcd123 'pangram' Typically you will choose as the loop variable one that you haven't used before. In the loop, Python fills it with each item on the list in turn. First, it puts "ngram" in abcd123. This, then, is printed within the block. Then it puts "isogram" into abcd123, and the block is executed with this value of abcd123. In the third execution of the loop, abcd123 is "cladogram", and the fourth time, it is "pangram". Then the list is exhausted, and the loop is done. Here is another example of a for-loop. >>> my_list = ["ngram", "isogram", "cladogram", "pangram"] >>> counter = 0 >>> for whatever in my_list: ... counter = counter + 1 ... >>> counter 4 This code does the same thing as len(my_list) . But it illustrates a general pattern that you will see very often: You initialize a counter (here: to zero), then you iterate over the list, and change the counter. Here is another example, which adds up all the numbers on a list. >>> numberlist = [345, 52, 1034, 79421] >>> mysum = 0 >>> for number in numberlist: ... mysum = mysum + number ... >>> mysum 80852 This code does the same as sum(numberlist) . Try it for yourself:
Here is another useful and frequent code pattern. Often, we want to collect results in a list as we go through a for-loop. For example, we may want to collect all uppercase words from a given text. The following text is from the Wikipedia page on Monty Python. >>>
mytext = """The Python programming language by Guido van Rossum is
named after the troupe, and Monty Python references are often found in
sample code created for that language. Additionally, a 2001 April Fool's
Day joke by van Rossum and Larry Wall involving the merger of Python
with Perl was dubbed "Parrot" after the Dead Parrot Sketch. The name
"Parrot" was later used for a project to develop a virtual machine for
running bytecode for interpreted languages such as Perl and Python.
Also, the Jet Propulsion Laboratory wrote some spacecraft navigation
software in Python, which they dubbed "Monty". There is also a python
refactoring tool called bicyclerepair ( [1] ), named after Bicycle
Repair Man sketch.""" >>> words = mytext.split() >>> >>> uppercase_words = [ ] >>> for word in words: ... if word[0].isupper(): ... uppercase_words.append(word) ... >>> uppercase_words ['The',
'Python', 'Guido', 'Rossum', 'Monty', 'Python', 'Additionally,',
'April', "Fool's", 'Day', 'Rossum', 'Larry', 'Wall', 'Python', 'Perl',
'Dead', 'Parrot', 'Sketch.', 'The', 'Perl', 'Python.', 'Also,', 'Jet',
'Propulsion', 'Laboratory', 'Python,', 'There', 'Bicycle', 'Repair',
'Man'] This code first splits the text into words. It then initializes a list uppercase_words, in which we want to collect results. Initially, that list is zero. We then iterate through all words of our text and check if they start with an uppercase letter, that is, if word[0] is a string consisting entirely of uppercase letters (see http://docs.python.org/3/library/stdtypes.html#sequence-types-str-unicode-list-tuple-bytearray-buffer-xrange for the use of isupper()). If that is the case, we append word to our collector list uppercase_words. Try it for yourself: Collect all words that end in "t" in the same text mytext that we just used. Ranges Suppose you wanted to have a list of the form [0, 1, 2, 3, 4], that is, a series of consecutive numbers. Then you can do that as before: >>> my_list = [0,1,2,3,4] But since this is a kind of data structure that is needed relatively often, Python has a shortcut for this: >>> my_range = range(5) >>> my _range range( 0,5) >>> list(my _range) [0, 1, 2, 3, 4] range(n) yields a range (something similar to a list) that starts at 0 and ends at n-1 (not n!). This is like with list slices: Remember that my_list[1:4] gave you the part of the list that started at index 1 and ended at index 3. You can also use range() with two parameters instead of one. range(j, k) yields the numbers from j to k-1: >>> list(range(20, 23)) [20, 21, 22] And what does this do? And why? >>> list(range(20, 30, 2)) You can use range() to count to ten: >>> for num in range(1, 11): . . . print( num ) Try it for yourself: How can you use range() to sum up the numbers from 1 to 20? You can also use range() to access the position of items on a list. >>> mylist = [ "coble", "noble", "roble" ] >>> for i in range(len(mylist)): ... print( i, mylist[ i ] ) ... 0 coble 1 noble 2 roble Here is what this does: range(len(mylist)) is range(3), which corresponds to the list[0, 1, 2]. The for-loop iterates over the numbers 0, 1, and 2. For each of them, it prints the number, then the list item with that index. A more complex task When
we process text, the first step is almost always to break it up into
words, which we can then count, label, or otherwise process. The Python
string function split() gets us most of the way there. It splits up text
on whitespace. But the result is not perfect: >>>
sentence = "Computational linguistics (sometimes also referred to as
natural language processing, NLP) is a highly interdisciplinary area." >>> sentence.split() ['Computational',
'linguistics', '(sometimes', 'also', 'referred', 'to', 'as', 'natural',
'language', 'processing,', 'NLP)', 'is', 'a', 'highly',
'interdisciplinary', 'area.'] Some of the items on the resulting list are words with punctuation attached to them, for example "(sometimes" or "processing,". This is a problem: When we count words, "(sometimes" and "sometimes" count as different strings -- which is not what we want. Instead, we would like punctuation to be separate, like this: [
'Computational', 'linguistics', '(', 'sometimes', 'also', 'referred',
'to', 'as', 'natural', 'language', 'processing', ',', 'NLP', ')', 'is',
'a', 'highly', 'interdisciplinary', 'area', '.' ] Write a tokenizer that splits text into words and puts punctuation into separate strings. One Python data structure that may be helpful is a collection of punctuation symbols that is available in the string package: >>> import string >>> string.punctuation '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~' However, the problem does not end there: In some cases, we do not want the punctuation to be split off, because it is part of the word, like in "U.S.A." or in "Dr.". Can you extend your method to deal with (some of) these cases? Your approach need not be perfect (and approaches to processing natural language in some way mostly aren't), just as accurate as you can make it. |
Courses > Python worksheets >