Sometimes we need to remove so-called stop words -- typically, high-frequency function words; some stop word list also remove some high-frequency content words -- before further processing our text. For example, we may want to eliminate those extremely high-frequency items before counting words or before trying to determine collocations from text.
Assume that there are only three stop words that you want to eliminate:
"the", "a", "and"
Assume the following input text (from Charles Dack, "Weather and Folk Lore of Peterborough and District"):
Split this text into words. Then do a loop that iterates over the words in the sentence. If a word is not equal to "the", "a" or "and", print it.
The problem is the same as in the previous problem. But instead of printing the non-stop words, we want to save them in a new list. Here is a code outline. Please fill in the missing bits.
The problem is still the same as before: We want to eliminate stop words in the first paragraph of "Weather and Folk Lore of Peterborough and District". However, this time we want to use a larger set of stop words.
Remember the Python construct "in", illustrated here. It will be useful for this problem.
often gets in our way if we want to process words. For example, consider the sentence "Pierre Vinken, 61 years old, will join the board as
a nonexecutive director Nov. 29." (This sentence is famous because it is the first sentence on the Wall
Street Journal corpus, which everyone uses as an example sentence.) If
we split this sentence into words to process it, it will, among other
things, contain "old.", which is unfortunately a different string from
"old" and won't be counted as the same word.
There are two things that we might want to do:
Assume for now that there are only 2 punctuation symbols we care about: fullstop and comma.
Version 1: Write a Python program that does the following, again with the above paragraph of text:
Version 2: Then modify your code such that it again starts with an empty list
and instead of printing the resulting word, store it in the list stripped_words
Version 3: Use
to familiarize yourself with the Python string function rstrip(). How can you use it to remove punctuation?