Courses‎ > ‎Python worksheets‎ > ‎

List and loop problems

Filtering stop words, version 1

Sometimes we need to remove so-called stop words -- typically, high-frequency function words; some stop word list also remove some high-frequency content words -- before further processing our text. For example, we may want to eliminate those extremely high-frequency items before counting words or before trying to determine collocations from text.

Assume that there are only three stop words that you want to eliminate:

    "the", "a", "and"

Assume the following input text (from Charles Dack, "Weather and Folk Lore of Peterborough and District"):

This is a continuation of a Paper on the "Survival of Old Customs" in
Peterborough and the neighbourhood which was read at the Royal
Archæological Society's meeting in 1898, with an addition of a few more
old customs, and more particulars of others, to which I have also added
a collection of the quaint Weather and Folk Lore of this district. Being
at a point where four counties are almost within a stone's throw,
Peterborough possesses the traditions of the Counties of Huntingdon,
Cambridge, and Lincoln, as well as Northampton. It is rather difficult
to locate these sayings to one particular County, so I have taken those
current within a radius of about fifteen miles.

Split this text into words. Then do a loop that iterates over the words in the sentence. If a word is not equal to "the", "a" or "and", print it.

Filtering stop words, version 2

The problem is the same as in the previous problem. But instead of printing the non-stop words, we want to save them in a new list. Here is a code outline. Please fill in the missing bits.

# store in a variable called words

# the text from the previous problem, split into words

# we start an empty list. During the course of the loop,

# words that are not filtered will be put in this list

filtered_words = [ ]

# main loop

for word in words:

    # if this word is not a stop word, store it in filtered_words

Filtering stop words, version 3

The problem is still the same as before: We want to eliminate stop words in the first paragraph of "Weather and Folk Lore of Peterborough and District". However, this time we want to use a larger set of stop words.

  • Do a web search for "stop words" and find a collection of stop words that you can use. Download this collection as a file.
  • Use Python file access methods to read this file into a Python list, such that each stop word is an element on the list.
  • Adapt your program from the problem above such that now it does the following:
    For each word in your list, it should test whether it is in the list of stop words.
    If it is not, then the word should be stored in filtered_words.

Remember the Python construct "in", illustrated here. It will be useful for this problem.

mylist = [1,2,3]

if 2 in mylist:

    print "yes"

Removing punctuation, versions 1, 2, and 3

Punctuation often gets in our way if we want to process words. For example, consider the sentence "Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29." (This sentence is famous because it is the first sentence on the Wall Street Journal corpus, which everyone uses as an example sentence.) If we split this sentence into words to process it, it will, among other things, contain "old.", which is unfortunately a different string from "old" and won't be counted as the same word.

There are two things that we might want to do:

  • Discard punctuation completely (which is what we will do here)
  • Separate punctuation from the words, treating it as a separate "word". (This way it is still accessible if we want to segment text into sentences rather than into words.)

Assume for now that there are only 2 punctuation symbols we care about: fullstop and comma.

Version 1: Write a Python program that does the following, again with the above paragraph of text:

  • Split the text into words.
  • Run a loop over each word of the text. For each word, if it ends with either a fullstop or a comma,
    use a string slice to obtain the word without the last character.
    Print the resulting word.

Version 2: Then modify your code such that it again starts with an empty list

stripped_words = [ ]

and instead of printing the resulting word, store it in the list stripped_words

Version 3: Use


to familiarize yourself with the Python string function rstrip(). How can you use it to remove punctuation?