Python worksheet: regular expressions
Regular expressions let you search for patterns in text. While the Python string methods index() and find() find only exact matches, regular expressions are much more flexible.
In this worksheet, we will use Lewis Carroll's Alice in Wonderland as the sample text. On Project Gutenberg, you can find it here.
Regular expressions are covered by the Python package re.
Exact match
In the simplest case, you can use regular expression to look for exact matches, like index() and find():
>>> import re
>>> mystring = "procrastinate"
>>> mystring.find("ocra")
2
>>> re.search("ocra", mystring)
<_sre.SRE_Match object; span=(2, 6), match='ocra'>
Searching for "ocra" succeeded, returning some mysterious object (which we will look into more later). If we search for something that is not there, the answer will be None:
>>> re.search("banana", mystring)
>>> type(re.search("banana", mystring))
<class 'NoneType'>
The simplest thing we can do is to use re.search() as a pseudo-Boolean expression: None works like False, and "anything that is not None" works like True.
>>> if re.search("ocra", mystring): print("found")
...
found
>>> if re.search("banana", mystring): print("found")
...
>>>
Now let's search for words in Alice in Wonderland: We want to find all the lines in Alice in Wonderland that contain the string "Hatter". We read the file into a list of strings, one line per string:
f = open("/Users/katrinerk/Desktop/pg11.txt")
alice_lines = f.readlines()
alice_lines = [l.rstrip() for l in alice_lines]
f.close()
for line in alice_lines:
if re.search("Hatter", line): print( line )
Regular expressions match substrings, not words. Here is what happens if we search for lines that contain the string "riddle":
>>> for line in alice_lines:
... if re.search("riddle", line): print(line)
...
begun asking riddles.--I believe I can guess that,' she added aloud.
'Have you guessed the riddle yet?' the Hatter said, turning to Alice
time,' she said, 'than waste it in asking riddles that have no answers.'
As you can see, we found "riddles" twice, and "riddle" once.
To drive the point home some more, here is another example that looks for a character sequence that is not a word:
for line in alice_lines:
if re.search("ed them", line): print line
Beyond exact matches
Now let's start looking at things that we can do with regular expressions but not with the string methods find() and index().
Say we want to find all occurrences of the word "Hatter" in Alice in Wonderland, but we aren't sure whether it is always capitalized or not. Then we can do this:
for line in alice_lines:
if re.search("[Hh]atter", line): print(line)
[Hh] matches a single letter which can be either H or h.
[aeiuo] matches any lowercase vowel.
[1234567890_abc] matches a digit or an underscore or a or b or c
[....] always matches a single character, and you are specifying all the possibilities of what it could be.
What will the following expression match?
for line in alice_lines:
if re.search("t[aeiou][aeiou]n", line): print( line )
Here are some shortcuts for bracket expression that are often needed:
[A-Z] matches any uppercase letter
[a-z] matches any lowercase letter
[0-9] matches any digit
You can combine them, for example in [A-Za-z0-9]
Are there any lines in Alice in Wonderland that contain longer sequences of digits?
for line in alice_lines:
if re.search("[0-9][0-9][0-9]", line): print(line)
You can also negate a bracket expression by putting a ^ (caret) directly after the opening bracket:
[^aeiou] matches any character that is not a lowercase vowel (what does that encompass?)
[^A-Za-z] matches anything but a letter
for line in alice_lines:
if re.search("[^aeiou][^aeiou][^aeiou]", line): print( line)
Another way of matching a single letter
You can also match a single letter like this:
\d matches a single digit, equivalent to [0-9]
\D matches a single character that is not a digit, equivalent to [^0-9]
\s matches a whitespace, equivalent to [\t\n\r\f\v]
\S matches a non-whitespace
\w matches an alphanumeric character, equivalent o [A-Za-z0-9_]
\W matches a non-alphanumeric character
Let's look for sequences of 3 or more digits again:
for line in alice_lines:
if re.search("\d\d\d", line): print( line)
What does this find?
for line in alice_lines:
if re.search("b\w\w\wed", line): print( line )
Matching "any character"
The period "." matches any single character: letter, digit, punctuation, whitespace, and anything else also. For example, "m..c" will match an occurrence of "m", then 2 characters whatever they may be, then "c".
for line in alice_lines:
if re.search("b...ed", line): print(line)
If you want to match a literal period, you have to put a backslash ("\") before it:
for line in alice_lines:
if re.search("ous\.", line): print(line)
This is called "unescaping" the special character -- it makes it nonspecial. This works with any special character: "\^" matches a literal caret, "\[" matches a straight bracket, and so on.
The "." is especially useful in combination with the "*" and the "+", which we discuss next.
Repetition
"a+" stands for "one or more a's". "+" means "one or more", and it follows the sequence to be repeated.
"*" (star) stands for "zero or more".
For example:
"\s+" will match: one or more whitespace characters
"\(.+\)" will match: an opening parenthesis (un-escaped), then one or more arbitrary characters, then a closing parenthesis (again, un-escaped)
"(in)*" will match: zero or more repetitions of the string "in" (here the parentheses are to group: They say that it's the whole of "in" that is being repeated, not just the "n")
What will this match? What could this be describing?
"\w+\s*=\s*\d+"
And how about this?
"[A-Z][a-z]*"
Here is some code that looks for the particle verb "give up" in a line, with arbitrarily many intervening characters:
for line in alice_lines:
if re.search("give.*up", line): print(line)
Can you do this for some other particle verb?
When would this go wrong: Can you think of cases where we wouldn't find a particle verb, or where we would find something we weren't looking for?
Both + and * can stand for arbitrarily many letters. But you can also specify the exact number of repetitions:
"a{5}" matches a sequence of 5 a's (but it also matches a sequence of 6 a's, because that will contain a sequence of 5 a's)
"(abc){3}" matches the sequence abcabcabc
Choice and optionality
A single verticle line "|" means "or". So
a|b
matches a single "a" or "b", same as [ab].
mov(es|ing|e|ed)
matches "moves", "moving", "move", and "moved".
for line in alice_lines:
if re.search("mov(es|ed|e|ing)", line): print(line)
"?" means optionality. For example,
sings?
will match "sing" as well as "sings", and
sing(ing)?
will match "sing" as well as "singing". So like + and *, ? comes after the sequence of characters that it applies to.
Try it for yourself:
Find in alice.txt:
lines with words with at least 7 letters
lines with words that contain 4 consecutive consonants
all word forms of the lemma "sing"
Make a single regular expression that will match all these different numbers:
123
3.1415
0.5
.5
-10
-543.23934896
Anchors
Anchors don't match any characters, they mark special places in a string: at the beginning and end of the string, and at the boundaries of words (now, finally, we get to a regular expression character that is not ignorant to what words are!).
"^" matches at the beginning of a string. So
"^123"
will only match strings that begin with "123". Here is how you would look for lines in Alice in Wonderland that start with "The":
for line in alice_lines:
if re.search("^The", line): print(line)
This will not match any occurrences of "The" that are not at the beginning of the line, as we can see by counting:
>>> counter = 0
>>> for line in alice_lines:
... if re.search("^The", line): counter = counter + 1
...
>>> counter
81
>>> counter = 0
>>> for line in alice_lines:
... if re.search("The", line): counter = counter + 1
...
>>> counter
198
So "The" occurs 198 times overall, but only 81 times at the beginning of a line.
"$" matches at the end of a string. So
"123$"
will match strings that end with "123". Here is how we would look for the word "Alice" occurring in the end of a line of
Alice in Wonderland:
for line in alice_lines:
if re.search("Alice$", line): print(line)
There are two more anchors, as promised:
\b matches a word boundary
\B matches anywhere but at a word boundary
A word of caution: Some combination of \ + letter have special interpretations in strings, for example \n is newline. \b is backspace (delete a character to the left). We don't want Python to interpret "\b" in a regular expression as backspace. The way to say that is to put an r for "raw" before your string. (Looks weird, but is correct.) Like this: r"\bsing\b". This will match the word "sing" but not "singing" and also not "cursing".
Looking for lines in Alice in Wonderland that contain the word "sing" produces this result:
>>> for line in alice_lines:
... if re.search(r"\bsing\b", line): print(line)
...
given by the Queen of Hearts, and I had to sing
'We can do without lobsters, you know. Which shall sing?'
'Oh, YOU sing,' said the Gryphon. 'I've forgotten the words.'
on. 'Or would you like the Mock Turtle to sing you a song?'
with sobs, to sing this:--
This is interesting: The second line has "sing?' ", the third has "sing,' " and both are recognized as "sing" followed by a word boundary. So \b is smart enough to identify punctuation.
Try it for yourself:
Find in Alice in Wonderland
lines that contain words of *exactly* 7 letters
lines that contain words with no vowels (you'll find more of these if you don't count "y" as a vowel)
Taking strings apart
So far, we have just checked whether a string matched an expression, without being able to say more specifically which parts of the string matched which parts of the expression. Now we change that.
Remember the mysterious Match object from above?
>>> mystring = "procrastinate"
>>> re.search("ocra", mystring)
<_sre.SRE_Match object; span=(2, 6), match='ocra'>
If you look closely, you see that this Match object says what was matched: the string 'ocra' starting from the 3rd letter and ending before the 7th letter (span = (2, 6)). Here is how you extract this information:
>>> mystring = "procrastinate"
>>> mobj = re.search("ocra", mystring)
>>> mobj.group(0)
'ocra'
>>> mobj.start()
2
>>> mobj.end()
6
So re.search() returns a Match object that has the following methods:
group(0): returns the whole matching substring
start(): returns the start index of the match
end(): returns the end index of the match
You can also do more fine-grained matches by using parentheses: Each pair of parentheses creates a "subgroup" that will be reported separately. Suppose we have a string that contains a phone number:
>>> mystring = "Phone number: 512-123-4567"
Then we can take it apart like this:
>>> mobj = re.search("Phone\D*(\d+)-(\d+)-(\d+)", mystring)
>>> mobj.group(0)
'Phone number: 512-123-4567'
>>> mobj.groups()
('512', '123', '4567')
>>> mobj.group(1)
'512'
>>> mobj.group(2)
'123'
>>> mobj.group(3)
'4567'
>>> mobj.start(1)
14
>>> mobj.end(1)
17
So, Match objects also have the following methods:
groups() returns all subgroups (but not the whole matching string)
group(1) returns the substring that matched the first set of parentheses (counting from the left), and accordingly for group(2).
start(1) returns the start index of the substring matching the first set of parentheses, and accordingly for start(2) and so on.
end(1) returns the end index of the substring matching the first set of parentheses
Try it for yourself:
Using alice_lines from above, print all words of 6 letters or more occurring in Alice in Wonderland (not the whole lines, just the matching words)
Greedy matching
Suppose we want to extract a substring starting at "<" and ending at a matching ">". We might try this:
>>> mystring = "<abc><def>"
>>> mobj = re.search("<.*>", mystring)
>>> mobj.group(0)
'<abc><def>'
But this didn't just extract <abc>, it extracted all of <abc><def> even though there was a ">" after the "c". This is because regular expressions do greedy matching: They always match the longest substring that they can. And since the longest substring here that has a "<", then arbitrary characters (which could also be ">") and that ends in a ">" is the whole string, it matches the whole thing.
You can tell your regular expression to match the shortest possible substring instead of the longest one, like this:
>>> mobj = re.search("<.*?>", mystring)
>>> mobj.group(0)
'<abc>'
By using *? instead of just *, you do a non-greedy match. There is also +?
Other functions
The re package in Python has other functions besides re.search(). You can find the documentation here: https://docs.python.org/3/library/re.html.
In particular, look at
re.split(): split on regular expressions rather than just strings
re.findall(): find all non-overlapping matches in a string and return them in a list
re.sub(): substitutes matches of a regex pattern by a string
A more complex task
In the "Conditions, Lists and Loops" worksheet I talked about the task of splitting off punctuation. As I said there, Python's split() function leaves punctuation wherever it attaches:
>>> sentence = "Computational linguistics (sometimes also referred to as natural language processing, NLP) is a highly interdisciplinary area."
>>> sentence.split()
['Computational', 'linguistics', '(sometimes', 'also', 'referred', 'to', 'as', 'natural', 'language', 'processing,', 'NLP)', 'is', 'a', 'highly', 'interdisciplinary', 'area.']
But when we count words, "(sometimes" and "sometimes" count as different strings -- which is not what we want. Instead, we would like punctuation to be separate, like this:
[ 'Computational', 'linguistics', '(', 'sometimes', 'also', 'referred', 'to', 'as', 'natural', 'language', 'processing', ',', 'NLP', ')', 'is', 'a', 'highly', 'interdisciplinary', 'area', '.' ]
Here is a first stab at an answer:
import re
import string
sentence = "Computational linguistics (sometimes also referred to as natural language processing, NLP) is a highly interdisciplinary area."
# zero or more punctuation characters, then zero or more arbitrary characters
# ending in a non-punctuation character,
# then zero or more punctuation characters.
# All three pieces extracted using ( )
expr = "([" + string.punctuation + "]*)(.*[^" + string.punctuation + "])([" + string.punctuation + "]*)"
sentence = "Computational linguistics (sometimes also referred to as natural language processing, NLP) is a highly interdisciplinary area."
# this will hold the split sentence.
pieces = [ ]
for word in sentence.split():
m = re.search(expr, word)
# append all the non-empty pieces to the new sentence
for piece in m.groups():
if len(piece) > 0:
pieces.append(piece)
sentence = "Computational linguistics (sometimes also referred to as natural language processing, NLP) is a highly interdisciplinary area."
This produces the following result:
>>> print(pieces)
['Computational', 'linguistics', '(', 'sometimes', 'also', 'referred', 'to', 'as', 'natural', 'language', 'processing', ',', 'NLP', ')', 'is', 'a', 'highly', 'interdisciplinary', 'area', '.']
So, we get this sentence right. But as I said, this is only first stab at a solution. The result will be suboptimal for:
U.S.A.
Dr.
doesn't
... anything else that you can think of?
Try it for yourself:
Find some words where this method would go wrong.
How can you improve on it?