Alternations & Substitutions

Verbose Flag

We’ll look at one more flag right now.

Let’s take a regular expression that validates UUIDs (universally unique identifiers:

def is_valid_uuid(uuid):
    return bool(re.search(r'^[a-f\d]{8}-[a-f\d]{4}-[a-f\d]{4}-[a-f\d]{4}-[a-f\d]{12}$', uuid, re.IGNORECASE))

That’s not easy to read at all.

We could make that more readable by splitting it over multiple lines:

def is_valid_uuid(uuid):
    uuid_regex = (
        r'^[a-f\d]{8}'
        r'-'
        r'[a-f\d]{4}'
        r'-'
        r'[a-f\d]{4}'
        r'-'
        r'[a-f\d]{4}'
        r'-'
        r'[a-f\d]{12}'
        r'$'
    )
    return bool(re.search(uuid_regex, uuid, re.IGNORECASE))

That helps some, but there’s a lot of extra quotes around each of the strings and it’s still kind of terse. It would be cool if there was a way to space out this regular expression in a multi-line string.

The VERBOSE flag does exactly this.

We’re already using the IGNORECASE flag. To use two flags we need to use the pipe operator (|) also known as the bitwise OR operator:

def is_valid_uuid(uuid):
    uuid_regex = (
        r'^[a-f\d]{8}'
        r'-'
        r'[a-f\d]{4}'
        r'-'
        r'[a-f\d]{4}'
        r'-'
        r'[a-f\d]{4}'
        r'-'
        r'[a-f\d]{12}'
        r'$'
    )
    return bool(re.search(uuid_regex, uuid, re.IGNORECASE | re.VERBOSE))

Now our regular expression will ignore whitespace:

def is_valid_uuid(uuid):
    uuid_regex = (r'''
        ^
        [ a-f \d ] {8}
        -
        [ a-f \d ] {4}
        -
        [ a-f \d ] {4}
        -
        [ a-f \d ] {4}
        -
        [ a-f \d ] {12}
        $
    ''')
    return bool(re.search(uuid_regex, uuid, re.IGNORECASE | re.VERBOSE))

We can also add as many comments to our regular expression as we want:

def is_valid_uuid(uuid):
    uuid_regex = (r'''
        ^               # beginning of string
        [ a-f \d ] {8}  # 8 hexadecimal digits
        -               # dash character
        [ a-f \d ] {4}  # 4 hexadecimal digits
        -               # dash character
        [ a-f \d ] {4}  # 4 hexadecimal digits
        -               # dash character
        [ a-f \d ] {4}  # 4 hexadecimal digits
        -               # dash character
        [ a-f \d ] {12} # 12 hexadecimal digits
        $               # end of string
    ''')
    return bool(re.search(uuid_regex, uuid, re.IGNORECASE | re.VERBOSE))

Unfortunately regular expressions, unlike Python, do not have a concept of variables or functions so we often can’t make our regular expressions as self-documenting as our Python code.

If you wanted to reuse parts of your regular expression you could try using string formatting, but the curly braces ({ and }) in your regular expression will need to be doubled up in order to escape them when using string formatting:

def is_valid_uuid(uuid):
    hex_re = r'[ a-f \d ]'
    uuid_re = rf'''
        ^               # beginning of string
        {hex_re} {{8}}  # 8 hexadecimal digits
        -               # dash character
        {hex_re} {{4}}  # 4 hexadecimal digits
        -               # dash character
        {hex_re} {{4}}  # 4 hexadecimal digits
        -               # dash character
        {hex_re} {{4}}  # 4 hexadecimal digits
        -               # dash character
        {hex_re} {{12}} # 12 hexadecimal digits
        $               # end of string
    '''
    return bool(re.search(uuid_re, uuid, re.IGNORECASE | re.VERBOSE))

Alternations

Let’s write a regular expression that matches valid 24 hour times.

We could start by matching two digits followed by a colon and two more digits:

>>> re.search(r'\d{2}:\d{2}', "00:13")
<_sre.SRE_Match object; span=(0, 5), match='00:13'>

This does match all valid 24 hour times but it also matches a lot of invalid times:

>>> re.search(r'\d{2}:\d{2}', "24:13")
<_sre.SRE_Match object; span=(0, 5), match='24:13'>
>>> re.search(r'\d{2}:\d{2}', "24:60")
<_sre.SRE_Match object; span=(0, 5), match='24:60'>

Let’s fix the minutes first. We only want to match minutes where the first digit is 0 through 5. The second digit can be anything:

>>> re.search(r'\d{2}:[0-5]\d', "24:60")
>>> re.search(r'\d{2}:[0-5]\d', "24:13")
<_sre.SRE_Match object; span=(0, 5), match='24:13'>

Now let’s try fixing the hours. We definitely want to narrow our hours down to starting with 0, 1, or 2. This isn’t quite enough though:

>>> re.search(r'[0-2]\d:[0-5]\d', "23:00")
<_sre.SRE_Match object; span=(0, 5), match='23:00'>
>>> re.search(r'[0-2]\d:[0-5]\d', "33:00")
>>> re.search(r'[0-2]\d:[0-5]\d', "24:00")
<_sre.SRE_Match object; span=(0, 5), match='24:00'>

What we really need is a way to combine these two regular expressions:

>>> re.search(r'[01]\d:[0-5]\d', "24:00")
>>> re.search(r'2[0-3]:[0-5]\d', "24:00")

We actually can combine regular expressions in Python if we use the | command to provide alternatives:

>>> re.search(r'[01]\d:[0-5]\d|2[0-3]:[0-5]\d', "24:00")

Character classes allow us to provide multiple options for a single character match.

The | character allows us to give multiple options for a collection of characters.

We can also use | in groups, so we could simplify that regular expression even further:

>>> re.search(r'([01]\d|2[0-3]):[0-5]\d', "24:00")
>>> re.search(r'([01]\d|2[0-3]):[0-5]\d', "23:00")
<_sre.SRE_Match object; span=(0, 5), match='23:00'>
>>> re.search(r'([01]\d|2[0-3]):[0-5]\d', "23:59")
<_sre.SRE_Match object; span=(0, 5), match='23:59'>
>>> re.search(r'([01]\d|2[0-3]):[0-5]\d', "23:60")

Basic Substitution

Let’s say we have some text that was written by a LaTeX user who uses two backticks `` and two apostrophe characters to represent left and right double quotes.

>>> sentence = "This string uses ``smart'' quotes."

We want to convert all of these sets of double backticks and double apostrophes to double quote characters.

We could do something like this:

>>> sentence.replace("``", '"').replace("''", '"')
'This string uses "smart" quotes.'

But we can also use regular expressions to accomplish the same task. For this we’ll use the sub function which stands for “substitution”:

>>> re.sub(r"``|''", '"', sentence)
'This string uses "smart" quotes.'

The sub function takes three arguments:

The regular expression to match
The replacement string
The string to operate on

Normalization

Let’s look at a task that wouldn’t have been well-suited to a string replacement.

Let’s make a regular expression that removes spaces after any commas.

We can do this by looking for commas with optional spaces after them and replaces that with just a comma:

>>> row = "column 1,column 2, column 3"
>>> re.sub(r',\s*', ',', row)
'column 1,column 2,column 3'

Substitution Exercises

These exercises are all in the substitution.py file in the exercises directory. Edit the appropriate function in that file to complete each exercise. To run the tests, from the exercises folder, type python test.py <function_name>, like this:

$ python test.py normalize_jpeg

Normalize JPEG Extension

Make a function that accepts a JPEG filename and returns a new filename with jpg lowercased without an e.

Tip

Modify the normalize_jpeg function in the substitution module.

Hint

Lookup how to pass flags to the re.sub function.

Example usage:

>>> normalize_jpeg('avatar.jpeg')
'avatar.jpg'
>>> normalize_jpeg('Avatar.JPEG')
'Avatar.jpg'
>>> normalize_jpeg('AVATAR.Jpg')
'AVATAR.jpg'

Normalize Whitespace

Make a function that replaces all instances of one or more whitespace characters with a single space.

Tip

Modify the normalize_whitespace function in the substitution module.

Example usage:

>>> normalize_whitespace("hello  there")
"hello there"
>>> normalize_whitespace("""Hold fast to dreams
... For if dreams die
... Life is a broken-winged bird
... That cannot fly.
...
... Hold fast to dreams
... For when dreams go
... Life is a barren field
... Frozen with snow.""")
'Hold fast to dreams For if dreams die Life is a broken-winged bird That cannot fly. Hold fast to dreams For when dreams go Life is a barren field Frozen with snow.'

Compress blank lines

Write a function that accepts a string and an integer N and compresses runs of N or more consecutive empty lines into just N empty lines.

Tip

Modify the compress_blank_lines function in the substitution module.

Example usage:

>>> compress_blank_lines("a\n\n\nb", max_blanks=1)
'a\n\nb'
>>> compress_blank_lines("a\n\nb", max_blanks=0)
'a\nb'
>>> compress_blank_lines("a\n\nb", max_blanks=2)
'a\n\nb'
>>> compress_blank_lines("a\n\n\n\nb\n\n\nc", max_blanks=2)
'a\n\n\nb\n\n\nc'

Normalize URL

I own the domain treyhunner.com. I prefer to link to my website as https://treyhunner.com, but I have some links that use http or use a www subdomain.

Write a function that normalizes all www.treyhunner.com and treyhunner.com links to use HTTPS and remove the www subdomain.

Tip

Modify the normalize_domain function in the substitution module.

Example usage:

>>> normalize_domain("http://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/")
'https://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/'
>>> normalize_domain("https://treyhunner.com/2016/02/how-to-merge-dictionaries-in-python/")
'https://treyhunner.com/2016/02/how-to-merge-dictionaries-in-python/'
>>> normalize_domain("http://www.treyhunner.com/2015/11/counting-things-in-python/")
'https://treyhunner.com/2015/11/counting-things-in-python/'
>>> normalize_domain("http://www.treyhunner.com")
'https://treyhunner.com'
>>> normalize_domain("http://trey.in/give-a-talk")
'http://trey.in/give-a-talk'

Linebreaks

Write a function that accepts a string and converts linebreaks to HTML in the following way:

text is surrounded by paragraphs
text with two or more line breaks between is considered two separate paragraphs
text with a single line break between is separated by a <br>

Tip

Modify the convert_linebreaks function in the substitution module.

Example usage:

>>> convert_linebreaks("hello")
'<p>hello</p>'
>>> convert_linebreaks("hello\nthere")
'<p>hello<br>there</p>'
>>> convert_linebreaks("hello\n\nthere")
'<p>hello</p><p>there</p>'
>>> convert_linebreaks("hello\nthere\n\nworld")
'<p>hello<br>there</p><p>world</p>'

Alternation Exercises

These exercises are all in the alternation.py file in the exercises directory. Edit the appropriate function in that file to complete each exercise. To run the tests, from the exercises folder, type python test.py <function_name>, like this:

$ python test.py is_number

Decimal Numbers

Write a function to match decimal numbers.

We want to allow an optional - and we want to match numbers with or without one decimal point.

Tip

Modify the is_number function in the alternation module.

Example usage:

>>> is_number("5")
True
>>> is_number("5.")
True
>>> is_number(".5.")
False
>>> is_number(".5")
True
>>> is_number("01.5")
True
>>> is_number("-123.859")
True
>>> is_number("-123.859.")
False
>>> is_number(".")
False

Abbreviate

Make a function that creates an acronym from a phrase.

Tip

Modify the abbreviate function in the alternation module.

Example usage:

>>> abbreviate('Graphics Interchange Format')
'GIF'
>>> abbreviate('frequently asked questions')
'FAQ'
>>> abbreviate('cascading style sheets')
'CSS'
>>> abbreviate('Joint Photographic Experts Group')
'JPEG'
>>> abbreviate('content management system')
'CMS'
>>> abbreviate('JavaScript Object Notation')
'JSON'
>>> abbreviate('HyperText Markup Language')
'HTML'

Hex Colors

Write a function to match hexadecimal color codes. Hex color codes consist of an octothorpe symbol followed by either 3 or 6 hexadecimal digits (that’s 0 to 9 or a to f).

Tip

Modify the is_hex_color function in the alternation module.

Example usage:

>>> is_hex_color("#639")
True
>>> is_hex_color("#6349")
False
>>> is_hex_color("#63459")
False
>>> is_hex_color("#634569")
True
>>> is_hex_color("#663399")
True
>>> is_hex_color("#000000")
True
>>> is_hex_color("#00")
False
>>> is_hex_color("#FFffFF")
True
>>> is_hex_color("#decaff")
True
>>> is_hex_color("#decafz")
False

Valid Date

Create a is_valid_date function that returns True if given a date in YYYY-MM-DD format.

For this exercise we’re more worried about accepting valid dates than we are about excluding invalid dates.

A regular expression is often used as a first wave of validation. Complete validation of dates should be done in our code outside of regular expressions.

Tip

Create this is_valid_date function in the alternation module.

Example usage:

>>> is_valid_date("2016-01-02")
True
>>> is_valid_date("1900-01-01")
True
>>> is_valid_date("2016-02-99")
False
>>> is_valid_date("20-02-20")
False
>>> is_valid_date("1980-30-05")
False

✕

↑

Write more Pythonic code

I send out 1 Python exercise every week through a Python skill-building service called Python Morsels.

If you'd like to improve your Python skills every week, sign up!

You can find the Privacy Policy here.
reCAPTCHA protected (Google Privacy Policy & TOS)