Alternations & Substitutions
Verbose Flag
We’ll look at one more flag right now.
Let’s take a regular expression that validates UUIDs (universally unique identifiers:
def is_valid_uuid(uuid):
return bool(re.search(r'^[a-f\d]{8}-[a-f\d]{4}-[a-f\d]{4}-[a-f\d]{4}-[a-f\d]{12}$', uuid, re.IGNORECASE))
That’s not easy to read at all.
We could make that more readable by splitting it over multiple lines:
def is_valid_uuid(uuid):
uuid_regex = (
r'^[a-f\d]{8}'
r'-'
r'[a-f\d]{4}'
r'-'
r'[a-f\d]{4}'
r'-'
r'[a-f\d]{4}'
r'-'
r'[a-f\d]{12}'
r'$'
)
return bool(re.search(uuid_regex, uuid, re.IGNORECASE))
That helps some, but there’s a lot of extra quotes around each of the strings and it’s still kind of terse. It would be cool if there was a way to space out this regular expression in a multi-line string.
The VERBOSE
flag does exactly this.
We’re already using the IGNORECASE
flag. To use two flags we need to use the pipe operator (|
) also known as the bitwise OR operator:
def is_valid_uuid(uuid):
uuid_regex = (
r'^[a-f\d]{8}'
r'-'
r'[a-f\d]{4}'
r'-'
r'[a-f\d]{4}'
r'-'
r'[a-f\d]{4}'
r'-'
r'[a-f\d]{12}'
r'$'
)
return bool(re.search(uuid_regex, uuid, re.IGNORECASE | re.VERBOSE))
Now our regular expression will ignore whitespace:
def is_valid_uuid(uuid):
uuid_regex = (r'''
^
[ a-f \d ] {8}
-
[ a-f \d ] {4}
-
[ a-f \d ] {4}
-
[ a-f \d ] {4}
-
[ a-f \d ] {12}
$
''')
return bool(re.search(uuid_regex, uuid, re.IGNORECASE | re.VERBOSE))
We can also add as many comments to our regular expression as we want:
def is_valid_uuid(uuid):
uuid_regex = (r'''
^ # beginning of string
[ a-f \d ] {8} # 8 hexadecimal digits
- # dash character
[ a-f \d ] {4} # 4 hexadecimal digits
- # dash character
[ a-f \d ] {4} # 4 hexadecimal digits
- # dash character
[ a-f \d ] {4} # 4 hexadecimal digits
- # dash character
[ a-f \d ] {12} # 12 hexadecimal digits
$ # end of string
''')
return bool(re.search(uuid_regex, uuid, re.IGNORECASE | re.VERBOSE))
Unfortunately regular expressions, unlike Python, do not have a concept of variables or functions so we often can’t make our regular expressions as self-documenting as our Python code.
If you wanted to reuse parts of your regular expression you could try using string formatting, but the curly braces ({
and }
) in your regular expression will need to be doubled up in order to escape them when using string formatting:
def is_valid_uuid(uuid):
hex_re = r'[ a-f \d ]'
uuid_re = rf'''
^ # beginning of string
{hex_re} {{8}} # 8 hexadecimal digits
- # dash character
{hex_re} {{4}} # 4 hexadecimal digits
- # dash character
{hex_re} {{4}} # 4 hexadecimal digits
- # dash character
{hex_re} {{4}} # 4 hexadecimal digits
- # dash character
{hex_re} {{12}} # 12 hexadecimal digits
$ # end of string
'''
return bool(re.search(uuid_re, uuid, re.IGNORECASE | re.VERBOSE))
Alternations
Let’s write a regular expression that matches valid 24 hour times.
We could start by matching two digits followed by a colon and two more digits:
>>> re.search(r'\d{2}:\d{2}', "00:13")
<_sre.SRE_Match object; span=(0, 5), match='00:13'>
This does match all valid 24 hour times but it also matches a lot of invalid times:
>>> re.search(r'\d{2}:\d{2}', "24:13")
<_sre.SRE_Match object; span=(0, 5), match='24:13'>
>>> re.search(r'\d{2}:\d{2}', "24:60")
<_sre.SRE_Match object; span=(0, 5), match='24:60'>
Let’s fix the minutes first. We only want to match minutes where the first digit is 0
through 5
. The second digit can be anything:
>>> re.search(r'\d{2}:[0-5]\d', "24:60")
>>> re.search(r'\d{2}:[0-5]\d', "24:13")
<_sre.SRE_Match object; span=(0, 5), match='24:13'>
Now let’s try fixing the hours. We definitely want to narrow our hours down to starting with 0
, 1
, or 2
. This isn’t quite enough though:
>>> re.search(r'[0-2]\d:[0-5]\d', "23:00")
<_sre.SRE_Match object; span=(0, 5), match='23:00'>
>>> re.search(r'[0-2]\d:[0-5]\d', "33:00")
>>> re.search(r'[0-2]\d:[0-5]\d', "24:00")
<_sre.SRE_Match object; span=(0, 5), match='24:00'>
What we really need is a way to combine these two regular expressions:
>>> re.search(r'[01]\d:[0-5]\d', "24:00")
>>> re.search(r'2[0-3]:[0-5]\d', "24:00")
We actually can combine regular expressions in Python if we use the |
command to provide alternatives:
>>> re.search(r'[01]\d:[0-5]\d|2[0-3]:[0-5]\d', "24:00")
Character classes allow us to provide multiple options for a single character match.
The |
character allows us to give multiple options for a collection of characters.
We can also use |
in groups, so we could simplify that regular expression even further:
>>> re.search(r'([01]\d|2[0-3]):[0-5]\d', "24:00")
>>> re.search(r'([01]\d|2[0-3]):[0-5]\d', "23:00")
<_sre.SRE_Match object; span=(0, 5), match='23:00'>
>>> re.search(r'([01]\d|2[0-3]):[0-5]\d', "23:59")
<_sre.SRE_Match object; span=(0, 5), match='23:59'>
>>> re.search(r'([01]\d|2[0-3]):[0-5]\d', "23:60")
Basic Substitution
Let’s say we have some text that was written by a LaTeX user who uses two backticks ``
and two apostrophe characters to represent left and right double quotes.
>>> sentence = "This string uses ``smart'' quotes."
We want to convert all of these sets of double backticks and double apostrophes to double quote characters.
We could do something like this:
>>> sentence.replace("``", '"').replace("''", '"')
'This string uses "smart" quotes.'
But we can also use regular expressions to accomplish the same task. For this we’ll use the sub
function which stands for “substitution”:
>>> re.sub(r"``|''", '"', sentence)
'This string uses "smart" quotes.'
The sub
function takes three arguments:
The regular expression to match
The replacement string
The string to operate on
Normalization
Let’s look at a task that wouldn’t have been well-suited to a string replacement.
Let’s make a regular expression that removes spaces after any commas.
We can do this by looking for commas with optional spaces after them and replaces that with just a comma:
>>> row = "column 1,column 2, column 3"
>>> re.sub(r',\s*', ',', row)
'column 1,column 2,column 3'
Substitution Exercises
These exercises are all in the substitution.py
file in the exercises
directory.
Edit the appropriate function in that file to complete each exercise.
To run the tests, from the exercises
folder, type python test.py <function_name>
, like this:
$ python test.py normalize_jpeg
Normalize JPEG Extension
Make a function that accepts a JPEG filename and returns a new filename with jpg lowercased without an e
.
Tip
Modify the normalize_jpeg
function in the substitution
module.
Hint
Lookup how to pass flags to the re.sub
function.
Example usage:
>>> normalize_jpeg('avatar.jpeg')
'avatar.jpg'
>>> normalize_jpeg('Avatar.JPEG')
'Avatar.jpg'
>>> normalize_jpeg('AVATAR.Jpg')
'AVATAR.jpg'
Normalize Whitespace
Make a function that replaces all instances of one or more whitespace characters with a single space.
Tip
Modify the normalize_whitespace
function in the substitution
module.
Example usage:
>>> normalize_whitespace("hello there")
"hello there"
>>> normalize_whitespace("""Hold fast to dreams
... For if dreams die
... Life is a broken-winged bird
... That cannot fly.
...
... Hold fast to dreams
... For when dreams go
... Life is a barren field
... Frozen with snow.""")
'Hold fast to dreams For if dreams die Life is a broken-winged bird That cannot fly. Hold fast to dreams For when dreams go Life is a barren field Frozen with snow.'
Compress blank lines
Write a function that accepts a string and an integer N
and compresses runs of N
or more consecutive empty lines into just N
empty lines.
Tip
Modify the compress_blank_lines
function in the substitution
module.
Example usage:
>>> compress_blank_lines("a\n\n\nb", max_blanks=1)
'a\n\nb'
>>> compress_blank_lines("a\n\nb", max_blanks=0)
'a\nb'
>>> compress_blank_lines("a\n\nb", max_blanks=2)
'a\n\nb'
>>> compress_blank_lines("a\n\n\n\nb\n\n\nc", max_blanks=2)
'a\n\n\nb\n\n\nc'
Normalize URL
I own the domain treyhunner.com. I prefer to link to my website as https://treyhunner.com
, but I have some links that use http
or use a www
subdomain.
Write a function that normalizes all www.treyhunner.com
and treyhunner.com
links to use HTTPS and remove the www
subdomain.
Tip
Modify the normalize_domain
function in the substitution
module.
Example usage:
>>> normalize_domain("http://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/")
'https://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/'
>>> normalize_domain("https://treyhunner.com/2016/02/how-to-merge-dictionaries-in-python/")
'https://treyhunner.com/2016/02/how-to-merge-dictionaries-in-python/'
>>> normalize_domain("http://www.treyhunner.com/2015/11/counting-things-in-python/")
'https://treyhunner.com/2015/11/counting-things-in-python/'
>>> normalize_domain("http://www.treyhunner.com")
'https://treyhunner.com'
>>> normalize_domain("http://trey.in/give-a-talk")
'http://trey.in/give-a-talk'
Linebreaks
Write a function that accepts a string and converts linebreaks to HTML in the following way:
text is surrounded by paragraphs
text with two or more line breaks between is considered two separate paragraphs
text with a single line break between is separated by a
<br>
Tip
Modify the convert_linebreaks
function in the substitution
module.
Example usage:
>>> convert_linebreaks("hello")
'<p>hello</p>'
>>> convert_linebreaks("hello\nthere")
'<p>hello<br>there</p>'
>>> convert_linebreaks("hello\n\nthere")
'<p>hello</p><p>there</p>'
>>> convert_linebreaks("hello\nthere\n\nworld")
'<p>hello<br>there</p><p>world</p>'
Alternation Exercises
These exercises are all in the alternation.py
file in the exercises
directory.
Edit the appropriate function in that file to complete each exercise.
To run the tests, from the exercises
folder, type python test.py <function_name>
, like this:
$ python test.py is_number
Decimal Numbers
Write a function to match decimal numbers.
We want to allow an optional -
and we want to match numbers with or without one decimal point.
Tip
Modify the is_number
function in the alternation
module.
Example usage:
>>> is_number("5")
True
>>> is_number("5.")
True
>>> is_number(".5.")
False
>>> is_number(".5")
True
>>> is_number("01.5")
True
>>> is_number("-123.859")
True
>>> is_number("-123.859.")
False
>>> is_number(".")
False
Abbreviate
Make a function that creates an acronym from a phrase.
Tip
Modify the abbreviate
function in the alternation
module.
Example usage:
>>> abbreviate('Graphics Interchange Format')
'GIF'
>>> abbreviate('frequently asked questions')
'FAQ'
>>> abbreviate('cascading style sheets')
'CSS'
>>> abbreviate('Joint Photographic Experts Group')
'JPEG'
>>> abbreviate('content management system')
'CMS'
>>> abbreviate('JavaScript Object Notation')
'JSON'
>>> abbreviate('HyperText Markup Language')
'HTML'
Hex Colors
Write a function to match hexadecimal color codes. Hex color codes consist of an octothorpe symbol followed by either 3 or 6 hexadecimal digits (that’s 0
to 9
or a
to f
).
Tip
Modify the is_hex_color
function in the alternation
module.
Example usage:
>>> is_hex_color("#639")
True
>>> is_hex_color("#6349")
False
>>> is_hex_color("#63459")
False
>>> is_hex_color("#634569")
True
>>> is_hex_color("#663399")
True
>>> is_hex_color("#000000")
True
>>> is_hex_color("#00")
False
>>> is_hex_color("#FFffFF")
True
>>> is_hex_color("#decaff")
True
>>> is_hex_color("#decafz")
False
Valid Date
Create a is_valid_date
function that returns True
if given a date in YYYY-MM-DD format.
For this exercise we’re more worried about accepting valid dates than we are about excluding invalid dates.
A regular expression is often used as a first wave of validation. Complete validation of dates should be done in our code outside of regular expressions.
Tip
Create this is_valid_date
function in the alternation
module.
Example usage:
>>> is_valid_date("2016-01-02")
True
>>> is_valid_date("1900-01-01")
True
>>> is_valid_date("2016-02-99")
False
>>> is_valid_date("20-02-20")
False
>>> is_valid_date("1980-30-05")
False
I send out 1 Python exercise every week through a Python skill-building service called Python Morsels.
If you'd like to improve your Python skills every week, sign up!
You can find the Privacy Policy here.reCAPTCHA protected (Google Privacy Policy & TOS)