Python worksheet: regular expressions

Regular expressions let you search for patterns in text. While the Python string methods index() and find() find only exact matches, regular expressions are much more flexible.

In this worksheet, we will use Lewis Carroll's Alice in Wonderland as the sample text. On Project Gutenberg, you can find it here.

Regular expressions are covered by the Python package re.

Exact match

In the simplest case, you can use regular expression to look for exact matches, like index() and find():

>>> import re

>>> mystring = "procrastinate"

>>> mystring.find("ocra")

2

>>> re.search("ocra", mystring)

<_sre.SRE_Match object; span=(2, 6), match='ocra'>

Searching for "ocra" succeeded, returning some mysterious object (which we will look into more later). If we search for something that is not there, the answer will be None:

>>> re.search("banana", mystring)

>>> type(re.search("banana", mystring))

<class 'NoneType'>

The simplest thing we can do is to use re.search() as a pseudo-Boolean expression: None works like False, and "anything that is not None" works like True.

>>> if re.search("ocra", mystring): print("found")

...

found

>>> if re.search("banana", mystring): print("found")

...

>>>

Now let's search for words in Alice in Wonderland: We want to find all the lines in Alice in Wonderland that contain the string "Hatter". We read the file into a list of strings, one line per string:

f = open("/Users/katrinerk/Desktop/pg11.txt")

alice_lines = f.readlines()

alice_lines = [l.rstrip() for l in alice_lines]

f.close()

for line in alice_lines:

     if re.search("Hatter", line): print( line )

Regular expressions match substrings, not words. Here is what happens if we search for lines that contain the string "riddle":

>>> for line in alice_lines:

...     if re.search("riddle", line): print(line)

...

begun asking riddles.--I believe I can guess that,' she added aloud.

'Have you guessed the riddle yet?' the Hatter said, turning to Alice

time,' she said, 'than waste it in asking riddles that have no answers.'

As you can see, we found "riddles" twice, and "riddle" once.

To drive the point home some more, here is another example that looks for a character sequence that is not a word:

for line in alice_lines:

     if re.search("ed them", line): print line

Beyond exact matches

Now let's start looking at things that we can do with regular expressions but not with the string methods find() and index().

Say we want to find all occurrences of the word "Hatter" in Alice in Wonderland, but we aren't sure whether it is always capitalized or not. Then we can do this:

for line in alice_lines:

    if re.search("[Hh]atter", line): print(line)

[Hh] matches a single letter which can be either H or h.

[aeiuo] matches any lowercase vowel.

[1234567890_abc] matches a digit or an underscore or a or b or c

[....] always matches a single character, and you are specifying all the possibilities of what it could be.

What will the following expression match?

for line in alice_lines:

    if re.search("t[aeiou][aeiou]n", line): print( line )

   

 Here are some shortcuts for bracket expression that are often needed:

Are there any lines in Alice in Wonderland that contain longer sequences of digits?

for line in alice_lines:

    if re.search("[0-9][0-9][0-9]", line): print(line)

You can also negate a bracket expression by putting a ^ (caret) directly after the opening bracket:

for line in alice_lines:

    if re.search("[^aeiou][^aeiou][^aeiou]", line): print( line)

Another way of matching a single letter

You can also match a single letter like this:

Let's look for sequences of 3 or more digits again:

for line in alice_lines:

    if re.search("\d\d\d", line): print( line)

What does this find?

for line in alice_lines:

    if re.search("b\w\w\wed", line): print( line )

Matching "any character"

The period "." matches any single character: letter, digit, punctuation, whitespace, and anything else also. For example, "m..c" will match an occurrence of "m", then 2 characters whatever they may be, then "c".

for line in alice_lines:

    if re.search("b...ed", line): print(line)

If you want to match a literal period, you have to put a backslash ("\") before it:

for line in alice_lines:

    if re.search("ous\.", line): print(line)

This is called "unescaping" the special character -- it makes it nonspecial. This works with any special character: "\^" matches a literal caret, "\[" matches a straight bracket, and so on.

The "." is especially useful in combination with the "*" and the "+", which we discuss next.

Repetition

"a+" stands for "one or more a's". "+" means "one or more", and it follows the sequence to be repeated.

"*" (star) stands for "zero or more".

For example:

What will this match? What could this be describing?

"\w+\s*=\s*\d+"

And how about this?

"[A-Z][a-z]*"

Here is some code that looks for the particle verb "give up" in a line, with arbitrarily many intervening characters:

for line in alice_lines:

    if re.search("give.*up", line): print(line)

Can you do this for some other particle verb?

When would this go wrong: Can you think of cases where we wouldn't find a particle verb, or where we would find something we weren't looking for?

Both + and * can stand for arbitrarily many letters. But you can also specify the exact number of repetitions:

"a{5}" matches a sequence of 5 a's (but it also matches a sequence of 6 a's, because that will contain a sequence of 5 a's)

"(abc){3}" matches the sequence abcabcabc

Choice and optionality

A single verticle line "|" means "or". So

    a|b

matches a single "a" or "b", same as [ab].

    mov(es|ing|e|ed)

matches "moves", "moving", "move", and "moved".

for line in alice_lines:

    if re.search("mov(es|ed|e|ing)", line): print(line)

"?" means optionality. For example,

    sings?

will match "sing" as well as "sings", and

    sing(ing)?

will match "sing" as well as "singing". So like + and *, ? comes after the sequence of characters that it applies to.

Try it for yourself:

Anchors

Anchors don't match any characters, they mark special places in a string: at the beginning and end of the string, and at the boundaries of words (now, finally, we get to a regular expression character that is not ignorant to what words are!).

"^" matches at the beginning of a string. So

    "^123"

will only match strings that begin with "123". Here is how you would look for lines in Alice in Wonderland that start with "The":

for line in alice_lines:

    if re.search("^The", line): print(line)

This will not match any occurrences of "The" that are not at the beginning of the line, as we can see by counting:

>>> counter = 0

>>> for line in alice_lines:

...     if re.search("^The", line): counter = counter + 1

...

>>> counter

81

>>> counter = 0

>>> for line in alice_lines:

...     if re.search("The", line): counter = counter + 1

...

>>> counter

198

So "The" occurs 198 times overall, but only 81 times at the beginning of a line.

"$" matches at the end of a string. So

    "123$"

will match strings that end with "123". Here is how we would look for the word "Alice" occurring in the end of a line of

Alice in Wonderland:

for line in alice_lines:

    if re.search("Alice$", line): print(line)

There are two more anchors, as promised:

A word of caution: Some combination of \ + letter have special interpretations in strings, for example \n is newline. \b is backspace (delete a character to the left). We don't want Python to interpret "\b" in a regular expression as backspace. The way to say that is to put an r for "raw" before  your string. (Looks weird, but is correct.) Like this: r"\bsing\b". This will match the word "sing" but not "singing" and also not "cursing".

Looking for lines in Alice in Wonderland that contain the word "sing" produces this result:

>>> for line in alice_lines:

...     if re.search(r"\bsing\b", line): print(line)

...

given by the Queen of Hearts, and I had to sing

'We can do without lobsters, you know. Which shall sing?'

'Oh, YOU sing,' said the Gryphon. 'I've forgotten the words.'

on. 'Or would you like the Mock Turtle to sing you a song?'

with sobs, to sing this:--

This is interesting: The second line has "sing?' ", the third has "sing,' " and both are recognized as "sing" followed by a word boundary. So \b is smart enough to identify punctuation.

Try it for yourself:

Find in Alice in Wonderland

Taking strings apart

So far, we have just checked whether a string matched an expression, without being able to say more specifically which parts of the string matched which parts of the expression. Now we change that.

Remember the mysterious Match object from above?

>>> mystring = "procrastinate"

>>> re.search("ocra", mystring)

<_sre.SRE_Match object; span=(2, 6), match='ocra'>

If you look closely, you see that this Match object says what was matched: the string 'ocra' starting from the 3rd letter and ending before the 7th letter (span = (2, 6)). Here is how you extract this information:

>>> mystring = "procrastinate"

>>> mobj = re.search("ocra", mystring)

>>> mobj.group(0)

'ocra'

>>> mobj.start()

2

>>> mobj.end()

6

So re.search() returns a Match object that has the following methods:

You can also do more fine-grained matches by using parentheses: Each pair of parentheses creates a "subgroup" that will be reported separately. Suppose we have a string that contains a phone number:

>>> mystring = "Phone number: 512-123-4567"

Then we can take it apart like this:

>>> mobj = re.search("Phone\D*(\d+)-(\d+)-(\d+)", mystring)

>>> mobj.group(0)

'Phone number: 512-123-4567'

>>> mobj.groups()

('512', '123', '4567')

>>> mobj.group(1)

'512'

>>> mobj.group(2)

'123'

>>> mobj.group(3)

'4567'

>>> mobj.start(1)

14

>>> mobj.end(1)

17

So, Match objects also have the following methods:

Try it for yourself:

Using alice_lines from above, print all words of 6 letters or more occurring in Alice in Wonderland (not the whole lines, just the matching words)

Greedy matching

Suppose we want to extract a substring starting at "<" and ending at a matching ">". We might try this:

>>> mystring = "<abc><def>"

>>> mobj = re.search("<.*>", mystring)

>>> mobj.group(0)

'<abc><def>'

But this didn't just extract <abc>, it extracted all of <abc><def> even though there was a ">" after the "c". This is because regular expressions do greedy matching: They always match the longest substring that they can. And since the longest substring here that has a "<", then arbitrary characters (which could also be ">") and that ends in a ">" is the whole string, it matches the whole thing.

You can tell your regular expression to match the shortest possible substring instead of the longest one, like this:

>>> mobj = re.search("<.*?>", mystring)

>>> mobj.group(0)

'<abc>'

By using *? instead of just *, you do a non-greedy match. There is also +?

Other functions

The re package in Python has other functions besides re.search(). You can find the documentation here: https://docs.python.org/3/library/re.html.

In particular, look at

A more complex task

In the "Conditions, Lists and Loops" worksheet I talked about the task of splitting off punctuation. As I said there, Python's split() function leaves punctuation wherever it attaches:

>>> sentence = "Computational linguistics (sometimes also referred to as natural language processing, NLP) is a highly interdisciplinary area."

>>> sentence.split()

['Computational', 'linguistics', '(sometimes', 'also', 'referred', 'to', 'as', 'natural', 'language', 'processing,', 'NLP)', 'is', 'a', 'highly', 'interdisciplinary', 'area.']

But when we count words, "(sometimes" and "sometimes" count as different strings -- which is not what we want. Instead, we would like punctuation to be separate, like this:

[ 'Computational', 'linguistics', '(', 'sometimes', 'also', 'referred', 'to', 'as', 'natural', 'language', 'processing', ',', 'NLP', ')', 'is', 'a', 'highly', 'interdisciplinary', 'area', '.' ]

Here is a first stab at an answer:

import re

import string

sentence = "Computational linguistics (sometimes also referred to as natural language processing, NLP) is a highly interdisciplinary area."

# zero or more punctuation characters, then zero or more arbitrary characters

# ending in a non-punctuation character,

# then zero or more punctuation characters.

# All three pieces extracted using ( )

expr = "([" + string.punctuation + "]*)(.*[^" + string.punctuation + "])([" + string.punctuation + "]*)"

sentence = "Computational linguistics (sometimes also referred to as natural language processing, NLP) is a highly interdisciplinary area."

# this will hold the split sentence.

pieces = [ ]

for word in sentence.split():

    m = re.search(expr, word)

    # append all the non-empty pieces to the new sentence

    for piece in m.groups():

        if len(piece) > 0:

            pieces.append(piece)

sentence = "Computational linguistics (sometimes also referred to as natural language processing, NLP) is a highly interdisciplinary area."

This produces the following result:

>>> print(pieces)

['Computational', 'linguistics', '(', 'sometimes', 'also', 'referred', 'to', 'as', 'natural', 'language', 'processing', ',', 'NLP', ')', 'is', 'a', 'highly', 'interdisciplinary', 'area', '.']

So, we get this sentence right. But as I said, this is only  first stab at a solution. The result will be suboptimal for:

Try it for yourself: