Regular Expressions Module
Special Sequences
We’ve already learned about character ranges for matching digits and alphanumeric characters. These are so common that there is a special shorthand we can use to represent these.
We can use \d
to match a digit character:
>>> re.search(r'\d', greeting)
>>> re.search(r'\d', '$100')
<_sre.SRE_Match object; span=(1, 2), match='1'>
So here’s another way to match a string containing one or more digit characters only:
>>> re.search(r'^\d+$', "$100")
>>> re.search(r'^\d+$', "100")
<_sre.SRE_Match object; span=(0, 3), match='100'>
This is a special sequence.
Special sequences look like escape characters in Python strings. They consist of a backslash and another character that denotes what the sequence represents.
Let’s look at a couple others.
A capital D
sequence matches non-digits:
>>> re.search(r'\D', '100')
>>> re.search(r'\D', '$100')
<_sre.SRE_Match object; span=(0, 1), match='$'>
So \d
and \D
are essentially shorthands for the digit ranges we’ve already seen so far:
>>> re.search(r'\D', '100')
>>> re.search(r'[^0-9]', '100')
>>> re.search(r'\d', '100')
<_sre.SRE_Match object; span=(0, 1), match='1'>
>>> re.search(r'[0-9]', '100')
<_sre.SRE_Match object; span=(0, 1), match='1'>
Another common special sequence is \w
:
>>> re.search(r'^\w*$', "$hell0")
>>> re.search(r'^\w*$', "hello")
<_sre.SRE_Match object; span=(0, 5), match='hello'>
>>> re.search(r'^\w*$', "hello_there")
<_sre.SRE_Match object; span=(0, 11), match='hello_there'>
>>> re.search(r'^\w*$', "hello there")
>>> re.search(r'^\w*$', "hello40")
<_sre.SRE_Match object; span=(0, 7), match='hello40'>
>>> re.search(r'^\[A-Za-z0-9_]*$', "hello40")
>>> re.search(r'^[A-Za-z0-9_]*$', "hello40")
<_sre.SRE_Match object; span=(0, 7), match='hello40'>
>>> re.search(r'^[A-Za-z0-9_]*$', "hello40_")
<_sre.SRE_Match object; span=(0, 8), match='hello40_'>
>>> re.search(r'^[A-Za-z0-9_]*$', "hello40 ")
This matches alphanumeric characters (aka “word” characters). These are letters, digits, or underscores.
There’s also \s
which matches whitespace characters:
>>> re.search(r'^\s*$', " ")
<_sre.SRE_Match object; span=(0, 1), match=' '>
>>> re.search(r'^\s*$', " \n")
<_sre.SRE_Match object; span=(0, 2), match=' \n'>
>>> re.search(r'^\s*$', " _\n")
We could match two words like this:
>>> re.search(r'^\w+\s+\w+$', 'hithere')
>>> re.search(r'^\w+\s+\w+$', 'hi there')
<_sre.SRE_Match object; span=(0, 8), match='hi there'>
>>> re.search(r'^\w+\s+\w+$', 'hi there, Trey')
>>> re.search(r'^\w+\s+\w+$', 'hi there Trey')
This \s
sequence matches newline characters, tabs, spaces. But it also matches weird things like vertical tabs, form feeds, and carriage returns.
Just like the digit special sequence, the word and space sequences have opposites. Capital W
matches non-word characters and capital S
matches non-space characters
>>> re.search(r'^\w+\W+\w+$', 'hi there')
<_sre.SRE_Match object; span=(0, 8), match='hi there'>
>>> re.search(r'^\w+\W+\w+$', 'hi*there')
<_sre.SRE_Match object; span=(0, 8), match='hi*there'>
>>> re.search(r'^\S+\s+\S+$', '_ $#@!')
<_sre.SRE_Match object; span=(0, 6), match='_ $#@!'>
Word Boundaries
The \b
special sequence is used for denoting word boundaries. This is an anchor and just like the ^
and $
anchors we’ve already seen, this doesn’t consume a character. This represents a location where a whitespace is or where the string starts or ends.
>>> re.search(r'\bhello\b', 'hello')
<_sre.SRE_Match object; span=(0, 5), match='hello'>
>>> re.search(r'\bhello\b', 'hello_there')
>>> re.search(r'\bhello\b', 'hello there')
<_sre.SRE_Match object; span=(0, 5), match='hello'>
>>> re.search(r'\bhello\b', 'oh hello there')
<_sre.SRE_Match object; span=(3, 8), match='hello'>
>>> re.search(r'\bhello\b', 'ohhello there')
Remember how we always put a r
before our regular expression strings to make them raw strings?
If we forget to do that when matching a word boundary, bad things happen:
>>> re.search('\bhello\b', 'hello')
>>> '\b'
'\x08'
>>> '\\b'
'\\b'
>>> r'\b'
'\\b'
The \b
escape character represents a backspace. This character can be used to remove characters from the terminal and redraw them.
If we want to represent a word boundary in our regular expressions we need to put \b
in a raw string to avoid the need to double-escape it so Python doesn’t think we’re talking about a backspace.
Repeats
Let’s write a regular expression that validates United States ZIP codes (shortened format only):
>>> re.search(r'^\d\d\d\d\d$', '90210')
<_sre.SRE_Match object; span=(0, 5), match='90210'>
>>> re.search(r'^\d\d\d\d\d$', '123456')
>>> re.search(r'^\d\d\d\d\d$', '1234')
>>> re.search(r'^\d\d\d\d\d$', '10001')
<_sre.SRE_Match object; span=(0, 5), match='10001'>
This regular expression matches 5 consecutive digits. We have a shortcut for matching 1 or more consecutive characters:
>>> re.search(r'^\d+$', '1234')
<_sre.SRE_Match object; span=(0, 4), match='1234'>
There’s also a shortcut for matching a particular number of consecutive characters:
>>> re.search(r'^\d{5}$', '1234')
>>> re.search(r'^\d{5}$', '12345')
<_sre.SRE_Match object; span=(0, 5), match='12345'>
>>> re.search(r'^\d{5}$', '123456')
We can also match ranges of repetitions with this.
Words 3 to 5 letters long:
>>> re.search(r'^\w{3,5}$', 'hi')
>>> re.search(r'^\w{3,5}$', 'cat')
<_sre.SRE_Match object; span=(0, 3), match='cat'>
>>> re.search(r'^\w{3,5}$', 'ball')
<_sre.SRE_Match object; span=(0, 4), match='ball'>
>>> re.search(r'^\w{3,5}$', 'hello')
<_sre.SRE_Match object; span=(0, 5), match='hello'>
>>> re.search(r'^\w{3,5}$', 'mellow')
Words 3 or more letters long:
>>> re.search(r'^\w{3,}$', 'cat')
<_sre.SRE_Match object; span=(0, 3), match='cat'>
>>> re.search(r'^\w{3,}$', 'ball')
<_sre.SRE_Match object; span=(0, 4), match='ball'>
>>> re.search(r'^\w{3,}$', 'hi')
Words 3 or fewer letters long:
>>> re.search(r'^\w{,3}$', 'hi')
<_sre.SRE_Match object; span=(0, 2), match='hi'>
>>> re.search(r'^\w{,3}$', 'cat')
<_sre.SRE_Match object; span=(0, 3), match='cat'>
>>> re.search(r'^\w{,3}$', 'ball')
Case Sensitivity
There is an optional third argument that we can provide to Python’s re.search
function. This third argument is a flags
argument.
One of these flags is the IGNORECASE
flag. This makes matches on lowercase and uppercase letters match both.
Here’s how we use this flag:
>>> re.search('hello', "Hello there")
>>> re.search('hello', "Hello there", re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 5), match='Hello'>
This works for character classes too:
>>> re.search('[A-Z]', "hi")
>>> re.search('[A-Z]', "hi", re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 1), match='h'>
The documentation has more information on the flags argument.
Searching
So far we’ve used regular expressions to validate whether something is inside a string or validate whether a string looks a certain way.
What if we want to find out the value that the regular expression actually matched?
The secret lies in the match object that is returned during a positive match.
We can do this with the group
method on our match object:
>>> sentence = "I'm flying out of SAN right now."
>>> m = re.search(r'\b[A-Z]{3}\b', sentence)
>>> m.group()
'SAN'
You can use the help
function to find other features of match objects:
>>> help(m)
Multi-search
What if we’re searching with a regular expression but we don’t want to stop at the first match?
We can find multiple matches using finditer
:
>>> sentence = "I'll be flying from SAN to PDX with a stop in SFO on the way"
>>> airport_matches = re.finditer(r'\b[A-Z]{3}\b', sentence)
>>> for m in airport_matches:
... print(m.group())
...
SAN
PDX
SFO
We could use a list comprehension to store our matches in a list:
>>> airports = [m.group() for m in re.finditer(r'\b[A-Z]{3}\b', sentence)]
>>> airports
['SAN', 'PDX', 'SFO']
There’s a helper function we can use that does something similar to this though:
>>> airport_codes = re.findall(r'\b[A-Z]{3}\b', sentence)
>>> airport_codes
['SAN', 'PDX', 'SFO']
This findall
function does not return match objects. Instead it returns the full string that was matched.
Validation and Search Exercises
These exercises are all in the validation.py
file in the exercises
directory.
Edit the appropriate function in that file to complete each exercise.
To run the tests, from the exercises
folder, type python test.py <function_name>
, like this:
$ python test.py has_word
Has Word
Create a function has_word
that accepts a single word string and a sentence string and returns True
if the sentence contains the word (as a word by itself), or False
otherwise.
Tip
Modify the has_word
function in the validation
module.
Your function should work like this:
>>> has_word('help', 'She was a big help when I learned French')
True
>>> has_word('help', 'She helped me learn French')
False
Four Letter Words
Create a function get_4_letter_words
which accepts a sentence and returns all four letter words from the given sentence.
Tip
Modify the get_4_letter_words
function in the validation
module.
Your function should work like this:
>>> get_4_letter_words("She was a big help when I learned French")
["help", "when"]
>>> get_4_letter_words('help', 'What is going on here?')
["What", "here"]
Is Email
Create a function is_email
that accepts a string and returns True
if the string represents a valid email address.
Tip
Modify the is_email
function in the validation
module.
Your function should work like this:
>>> is_email('123@example.com')
True
>>> is_email('info123@help.example.com')
True
>>> is_email('help+info@help-example.com')
True
>>> is_email('100%@help-example.com')
True
>>> is_email('123@example.c')
False
>>> is_email('123example.com')
False
Is Phone Number
Create a function is_phone_number
that accepts a string and returns True
if the string represents an valid US-style phone number.
Let’s just concern ourselves with allowing (xxx)yyy-zzzz
, (xxx) yyy-zzzz
, or xxx-yyy-zzzz
.
Tip
Modify the is_phone_number
function in the validation
module.
Your function should work like this:
>>> is_phone_number('202-762-1401')
True
>>> is_phone_number('(202)762-1401')
True
>>> is_phone_number('(202) 762-1401')
True
>>> is_phone_number('20-2762-1401')
False
>>> is_phone_number('202 762-1401')
True
>>> is_phone_number('2027621401')
True
Get Email
Create a function get_email
that accepts a string, searches for an email address and returns the email address from the string. If there is no valid email, it should return None
.
Tip
Modify the get_email
function in the validation
module.
Your function should work like this:
>>> get_email('Send an email to info@example.com for information')
'info@example.com'
>>> get_email('Do not use email of info@example.c.')
>>> get_email('Help is available at info123@help.example.com.')
'info123@help.example.com'
I send out 1 Python exercise every week through a Python skill-building service called Python Morsels.
If you'd like to improve your Python skills every week, sign up!
You can find the Privacy Policy here.reCAPTCHA protected (Google Privacy Policy & TOS)