Regular Expressions Module

Special Sequences

We’ve already learned about character ranges for matching digits and alphanumeric characters. These are so common that there is a special shorthand we can use to represent these.

We can use \d to match a digit character:

>>> re.search(r'\d', greeting)
>>> re.search(r'\d', '$100')
<_sre.SRE_Match object; span=(1, 2), match='1'>

So here’s another way to match a string containing one or more digit characters only:

>>> re.search(r'^\d+$', "$100")
>>> re.search(r'^\d+$', "100")
<_sre.SRE_Match object; span=(0, 3), match='100'>

This is a special sequence.

Special sequences look like escape characters in Python strings. They consist of a backslash and another character that denotes what the sequence represents.

Let’s look at a couple others.

A capital D sequence matches non-digits:

>>> re.search(r'\D', '100')
>>> re.search(r'\D', '$100')
<_sre.SRE_Match object; span=(0, 1), match='$'>

So \d and \D are essentially shorthands for the digit ranges we’ve already seen so far:

>>> re.search(r'\D', '100')
>>> re.search(r'[^0-9]', '100')
>>> re.search(r'\d', '100')
<_sre.SRE_Match object; span=(0, 1), match='1'>
>>> re.search(r'[0-9]', '100')
<_sre.SRE_Match object; span=(0, 1), match='1'>

Another common special sequence is \w:

>>> re.search(r'^\w*$', "$hell0")
>>> re.search(r'^\w*$', "hello")
<_sre.SRE_Match object; span=(0, 5), match='hello'>
>>> re.search(r'^\w*$', "hello_there")
<_sre.SRE_Match object; span=(0, 11), match='hello_there'>
>>> re.search(r'^\w*$', "hello there")
>>> re.search(r'^\w*$', "hello40")
<_sre.SRE_Match object; span=(0, 7), match='hello40'>
>>> re.search(r'^\[A-Za-z0-9_]*$', "hello40")
>>> re.search(r'^[A-Za-z0-9_]*$', "hello40")
<_sre.SRE_Match object; span=(0, 7), match='hello40'>
>>> re.search(r'^[A-Za-z0-9_]*$', "hello40_")
<_sre.SRE_Match object; span=(0, 8), match='hello40_'>
>>> re.search(r'^[A-Za-z0-9_]*$', "hello40 ")

This matches alphanumeric characters (aka “word” characters). These are letters, digits, or underscores.

There’s also \s which matches whitespace characters:

>>> re.search(r'^\s*$', " ")
<_sre.SRE_Match object; span=(0, 1), match=' '>
>>> re.search(r'^\s*$', " \n")
<_sre.SRE_Match object; span=(0, 2), match=' \n'>
>>> re.search(r'^\s*$', " _\n")

We could match two words like this:

>>> re.search(r'^\w+\s+\w+$', 'hithere')
>>> re.search(r'^\w+\s+\w+$', 'hi there')
<_sre.SRE_Match object; span=(0, 8), match='hi there'>
>>> re.search(r'^\w+\s+\w+$', 'hi there, Trey')
>>> re.search(r'^\w+\s+\w+$', 'hi there Trey')

This \s sequence matches newline characters, tabs, spaces. But it also matches weird things like vertical tabs, form feeds, and carriage returns.

Just like the digit special sequence, the word and space sequences have opposites. Capital W matches non-word characters and capital S matches non-space characters

>>> re.search(r'^\w+\W+\w+$', 'hi there')
<_sre.SRE_Match object; span=(0, 8), match='hi there'>
>>> re.search(r'^\w+\W+\w+$', 'hi*there')
<_sre.SRE_Match object; span=(0, 8), match='hi*there'>
>>> re.search(r'^\S+\s+\S+$', '_ $#@!')
<_sre.SRE_Match object; span=(0, 6), match='_ $#@!'>

Word Boundaries

The \b special sequence is used for denoting word boundaries. This is an anchor and just like the ^ and $ anchors we’ve already seen, this doesn’t consume a character. This represents a location where a whitespace is or where the string starts or ends.

>>> re.search(r'\bhello\b', 'hello')
<_sre.SRE_Match object; span=(0, 5), match='hello'>
>>> re.search(r'\bhello\b', 'hello_there')
>>> re.search(r'\bhello\b', 'hello there')
<_sre.SRE_Match object; span=(0, 5), match='hello'>
>>> re.search(r'\bhello\b', 'oh hello there')
<_sre.SRE_Match object; span=(3, 8), match='hello'>
>>> re.search(r'\bhello\b', 'ohhello there')

Remember how we always put a r before our regular expression strings to make them raw strings?

If we forget to do that when matching a word boundary, bad things happen:

>>> re.search('\bhello\b', 'hello')
>>> '\b'
'\x08'
>>> '\\b'
'\\b'
>>> r'\b'
'\\b'

The \b escape character represents a backspace. This character can be used to remove characters from the terminal and redraw them.

If we want to represent a word boundary in our regular expressions we need to put \b in a raw string to avoid the need to double-escape it so Python doesn’t think we’re talking about a backspace.

Repeats

Let’s write a regular expression that validates United States ZIP codes (shortened format only):

>>> re.search(r'^\d\d\d\d\d$', '90210')
<_sre.SRE_Match object; span=(0, 5), match='90210'>
>>> re.search(r'^\d\d\d\d\d$', '123456')
>>> re.search(r'^\d\d\d\d\d$', '1234')
>>> re.search(r'^\d\d\d\d\d$', '10001')
<_sre.SRE_Match object; span=(0, 5), match='10001'>

This regular expression matches 5 consecutive digits. We have a shortcut for matching 1 or more consecutive characters:

>>> re.search(r'^\d+$', '1234')
<_sre.SRE_Match object; span=(0, 4), match='1234'>

There’s also a shortcut for matching a particular number of consecutive characters:

>>> re.search(r'^\d{5}$', '1234')
>>> re.search(r'^\d{5}$', '12345')
<_sre.SRE_Match object; span=(0, 5), match='12345'>
>>> re.search(r'^\d{5}$', '123456')

We can also match ranges of repetitions with this.

Words 3 to 5 letters long:

>>> re.search(r'^\w{3,5}$', 'hi')
>>> re.search(r'^\w{3,5}$', 'cat')
<_sre.SRE_Match object; span=(0, 3), match='cat'>
>>> re.search(r'^\w{3,5}$', 'ball')
<_sre.SRE_Match object; span=(0, 4), match='ball'>
>>> re.search(r'^\w{3,5}$', 'hello')
<_sre.SRE_Match object; span=(0, 5), match='hello'>
>>> re.search(r'^\w{3,5}$', 'mellow')

Words 3 or more letters long:

>>> re.search(r'^\w{3,}$', 'cat')
<_sre.SRE_Match object; span=(0, 3), match='cat'>
>>> re.search(r'^\w{3,}$', 'ball')
<_sre.SRE_Match object; span=(0, 4), match='ball'>
>>> re.search(r'^\w{3,}$', 'hi')

Words 3 or fewer letters long:

>>> re.search(r'^\w{,3}$', 'hi')
<_sre.SRE_Match object; span=(0, 2), match='hi'>
>>> re.search(r'^\w{,3}$', 'cat')
<_sre.SRE_Match object; span=(0, 3), match='cat'>
>>> re.search(r'^\w{,3}$', 'ball')

Case Sensitivity

There is an optional third argument that we can provide to Python’s re.search function. This third argument is a flags argument.

One of these flags is the IGNORECASE flag. This makes matches on lowercase and uppercase letters match both.

Here’s how we use this flag:

>>> re.search('hello', "Hello there")
>>> re.search('hello', "Hello there", re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 5), match='Hello'>

This works for character classes too:

>>> re.search('[A-Z]', "hi")
>>> re.search('[A-Z]', "hi", re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 1), match='h'>

The documentation has more information on the flags argument.

Searching

So far we’ve used regular expressions to validate whether something is inside a string or validate whether a string looks a certain way.

What if we want to find out the value that the regular expression actually matched?

The secret lies in the match object that is returned during a positive match.

We can do this with the group method on our match object:

>>> sentence = "I'm flying out of SAN right now."
>>> m = re.search(r'\b[A-Z]{3}\b', sentence)
>>> m.group()
'SAN'

You can use the help function to find other features of match objects:

>>> help(m)

Multi-search

What if we’re searching with a regular expression but we don’t want to stop at the first match?

We can find multiple matches using finditer:

>>> sentence = "I'll be flying from SAN to PDX with a stop in SFO on the way"
>>> airport_matches = re.finditer(r'\b[A-Z]{3}\b', sentence)
>>> for m in airport_matches:
...     print(m.group())
...
SAN
PDX
SFO

We could use a list comprehension to store our matches in a list:

>>> airports = [m.group() for m in re.finditer(r'\b[A-Z]{3}\b', sentence)]
>>> airports
['SAN', 'PDX', 'SFO']

There’s a helper function we can use that does something similar to this though:

>>> airport_codes = re.findall(r'\b[A-Z]{3}\b', sentence)
>>> airport_codes
['SAN', 'PDX', 'SFO']

This findall function does not return match objects. Instead it returns the full string that was matched.

Validation and Search Exercises

These exercises are all in the validation.py file in the exercises directory. Edit the appropriate function in that file to complete each exercise. To run the tests, from the exercises folder, type python test.py <function_name>, like this:

$ python test.py has_word

Has Word

Create a function has_word that accepts a single word string and a sentence string and returns True if the sentence contains the word (as a word by itself), or False otherwise.

Tip

Modify the has_word function in the validation module.

Your function should work like this:

>>> has_word('help', 'She was a big help when I learned French')
True
>>> has_word('help', 'She helped me learn French')
False

Four Letter Words

Create a function get_4_letter_words which accepts a sentence and returns all four letter words from the given sentence.

Tip

Modify the get_4_letter_words function in the validation module.

Your function should work like this:

>>> get_4_letter_words("She was a big help when I learned French")
["help", "when"]
>>> get_4_letter_words('help', 'What is going on here?')
["What", "here"]

Is Email

Create a function is_email that accepts a string and returns True if the string represents a valid email address.

Tip

Modify the is_email function in the validation module.

Your function should work like this:

>>> is_email('123@example.com')
True
>>> is_email('info123@help.example.com')
True
>>> is_email('help+info@help-example.com')
True
>>> is_email('100%@help-example.com')
True
>>> is_email('123@example.c')
False
>>> is_email('123example.com')
False

Is Phone Number

Create a function is_phone_number that accepts a string and returns True if the string represents an valid US-style phone number.

Let’s just concern ourselves with allowing (xxx)yyy-zzzz, (xxx) yyy-zzzz, or xxx-yyy-zzzz.

Tip

Modify the is_phone_number function in the validation module.

Your function should work like this:

>>> is_phone_number('202-762-1401')
True
>>> is_phone_number('(202)762-1401')
True
>>> is_phone_number('(202) 762-1401')
True
>>> is_phone_number('20-2762-1401')
False
>>> is_phone_number('202 762-1401')
True
>>> is_phone_number('2027621401')
True

Get Email

Create a function get_email that accepts a string, searches for an email address and returns the email address from the string. If there is no valid email, it should return None.

Tip

Modify the get_email function in the validation module.

Your function should work like this:

>>> get_email('Send an email to info@example.com for information')
'info@example.com'
>>> get_email('Do not use email of info@example.c.')
>>> get_email('Help is available at info123@help.example.com.')
'info123@help.example.com'

✕

↑

Write more Pythonic code

I send out 1 Python exercise every week through a Python skill-building service called Python Morsels.

If you'd like to improve your Python skills every week, sign up!

You can find the Privacy Policy here.
reCAPTCHA protected (Google Privacy Policy & TOS)