Advanced Features

Using Captures in Substitutions

Let’s say we’re working with a document that was created in the US using MM/DD/YYYY format and we want to convert it to YYYY-MM-DD.

This isn’t just a simple replacement of / with - because the order of the numbers changes.

We can solve this by referencing our capturing groups in substitutions. Each group can be referenced with a backslash and the group number (\N).

>>> sentence = "from 12/22/1629 to 11/14/1643"
>>> re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\1-\2', sentence)
'from 1629-12-22 to 1643-11-14'

These references to capture groups are called back-references.

Using Captures

We can actually use back-references in the regular expression pattern also. Let’s look at an example.

Let’s modify our quotation matcher from earlier to search for either double- or single-quoted strings. Let’s try doing it this way:

>>> re.findall(r'["\'](.*?)["\']', "she said 'not really'")
['not really']

This would match unmatched quotes though:

>>> sentence = """You said "why?" and I say "I don't know"."""
>>> re.findall(r'["\'](.*?)["\']', sentence)
['why?', 'I don']

We need the end quote to be the same as the beginning quote. We can do this with a backreference:

>>> sentence = """You said "why?" and I say "I don't know"."""
>>> re.findall(r'(["\'])(.*?)\1', sentence)
[('"', 'why?'), ('"', "I don't know")]

The result from this isn’t exactly what we wanted. We’re getting both the quote character and the matching quotation.

Unfortunately we can’t make that first group that matches our quote non-capturing because need to reference it in our string.

We can retrieve just the quotation by using a list comprehension with findall:

>>> sentence = """You said "why?" and I say "I don't know"."""
>>> matches = re.findall(r'(["\'])(.*?)\1', sentence)
>>> [q for _, q in matches]
['why?', "I don't know"]

We could instead use a list comprehension with finditer:

>>> sentence = """You said "why?" and I say "I don't know"."""
>>> matches = re.finditer(r'(["\'])(.*?)\1', sentence)
>>> [m.group(2) for m in matches]
['why?', "I don't know"]

Note

We could have also used zip:

>>> sentence = """You said "why?" and I say "I don't know"."""
>>> _, matches = zip(*re.findall(r'(["\'])(.*?)\1', sentence))
>>> matches
('why?', "I don't know")

Split

Let’s say we have a string of values that are delimited by commas with optional spaces after them for readability. For example:

>>> row = "column 1,column 2, column 3"

We could to something like this to match words separated by a comma and one or more spaces:

>>> row = "column 1,column 2, column 3"
>>> re.findall(r'(.*),\s*', row)
['column 1,column 2']

That doesn’t work because that . is matching everything including commas. Let’s match non-commas:

>>> re.findall(r'([^,]*),\s*', row)
['column 1', 'column 2']

This doesn’t match column 3 because there’s no comma after it. We can use alternation to match a comma and spaces or the end of the string.

We could use an alternative to match the end of the string which gives us pretty much what we want:

>>> re.findall(r'([^,]*)(?:,\s*|$)', row)
['column 1', 'column 2', 'column 3', '']

But there’s a simpler way to do this.

Python’s re module has a split function we can use to split a string based on a delimiter specified by a regular expression:

>>> re.split(r',\s*', row)
['column 1', 'column 2', 'column 3']

That’s a lot easier to read.

Note that this is different from regular string splitting because we’re defining a regular expression:

>>> row.split(', ')
['column 1,column 2', 'column 3']
>>> row.split(',')
['column 1', 'column 2', ' column 3']
>>> row.split(',\s*')
['column 1,column 2, column 3']
>>> re.split(r',\s*', row)
['column 1', 'column 2', 'column 3']

Compiled

Executing a search with the same regular expression multiple times is inefficient and it can also encourage unreadable code.

Python’s re module has a compile function that allows us to compile a regular expression for later use.

We could use it for searching, even searching multiple times:

>>> TIME_RE = re.compile(r'^([01]\d|2[0-3]):[0-5]\d$')
>>> TIME_RE.search("00:00")
<_sre.SRE_Match object; span=(0, 5), match='00:00'>
>>> TIME_RE.search("00:90")
>>> TIME_RE.search("23:59")
<_sre.SRE_Match object; span=(0, 5), match='23:59'>
>>> TIME_RE.search("29:00")

We can also use it for splitting:

>>> row = "column 1,column 2, column 3"
>>> COMMA_RE = re.compile(r',\s*')
>>> COMMA_RE.split(row)
['column 1', 'column 2', 'column 3']

The object returned from re.compile represents a compile regular expression pattern:

>>> TIME_RE
re.compile('^([01]\\d|2[0-3]):[0-5]\\d$')
>>> COMMA_RE
re.compile(',\\s*')
>>> type(TIME_RE)
<class '_sre.SRE_Pattern'>

Pretty much all of the regular expression functions in the re module have an equivalent method on this compiled regular expression object.

Greediness

What if we want to match all quoted phrases in a string?

We could do something like this:

>>> re.search(r'".*"', 'Maya Angelou said "nothing will work unless you do"')
<_sre.SRE_Match object; span=(18, 51), match='"nothing will work unless you do"'>

This works but it would match too much when there are multiple quoted phrases.

>>> sentence = """You said "why?" and I say "I don't know"."""
>>> re.findall(r'"(.*)"', sentence)
['why?" and I say "I don\'t know']

The problem is that regular expressions are greedy.

Whenever we use the *, +, ?, or {n,m} operators to repeat something the regular expression engine will try to repeat the match as many times as possible and backtrack to find fewer matches only when something goes wrong with the matching.

For example:

>>> re.findall('hi*', 'hiiiii')
['hiiiii']
>>> re.findall('hi?', 'hiiiii')
['hi']
>>> re.findall('hi+', 'hiiiii')
['hiiiii']
>>> re.findall('hi{2,}', 'hiiiii')
['hiiiii']
>>> re.findall('hi{1,3}', 'hiiiii')
['hiii']

We can make each of these operators non-greedy by putting a question mark after it:

>>> re.findall('hi*?', 'hiiiii')
['h']
>>> re.findall('hi??', 'hiiiii')
['h']
>>> re.findall('hi+?', 'hiiiii')
['hi']
>>> re.findall('hi{2,}?', 'hiiiii')
['hii']
>>> re.findall('hi{1,3}?', 'hiiiii')
['hi']

That ? might seem a little confusing since we already use a ? to match something 0 or 1 times. This ? is different though: we’re using it to modify these repetitions to be non-greedy so they match as few times as possible.

Let’s use a non-greedy pattern to match only until the next quote character:

>>> sentence = """You said "why?" and I say "I don't know"."""
>>> re.findall(r'"(.*?)"', sentence)
['why?', "I don't know"]

Named Capture Groups

Capture groups are neat but sometimes it can be a little confusing figuring out what the group numbers are.

Sometimes it’s also a little confusing when you’re switching around numeric backreferences and trying to figure out which one is which.

Named capture groups can help us here.

Let’s use these on our date substitution:

>>> sentence = "from 12/22/1629 to 11/14/1643"
>>> re.sub(r'(?P<month>\d{2})/(?P<day>\d{2})/(?P<year>\d{4})', r'\g<year>-\g<month>-\g<day>', sentence)
'from 1629-12-22 to 1643-11-14'

That syntax is a little weird. The ?P after the parenthesis allows us to specify a group name in brackets (< … >). That group name can be referenced later using \g and brackets.

We can also us named groups without substitutions.

>>> sentence = "from 12/22/1629 to 11/14/1643"
>>> m = re.search(r'(?P<month>\d{2})/(?P<day>\d{2})/(?P<year>\d{4})', sentence)
>>> m.groups()
('12', '22', '1629')
>>> m.groupdict()
{'day': '22', 'month': '12', 'year': '1629'}

The groups act just like before, but we can also use groupdict to get dictionaries containing the named groups.

Unfortunately, re.findall doesn’t act any different with named groups:

>>> re.findall(r'(?P<month>\d{2})/(?P<day>\d{2})/(?P<year>\d{4})', sentence)
[('12', '22', '1629'), ('11', '14', '1643')]

We could use re.finditer to get match objects and use groupdict to get the dictionary for each one though:

>>> matches = re.finditer(r'(?P<month>\d{2})/(?P<day>\d{2})/(?P<year>\d{4})', sentence)
>>> [m.groupdict() for m in matches]
[{'day': '22', 'month': '12', 'year': '1629'}, {'day': '14', 'month': '11', 'year': '1643'}]

Substitution Functions

What if we want to allow our month/day/year substitution to support 2 digit years?

As humans we are pretty good at knowing how to do this conversion, but we’d need to do some kind of conditional algorithm to determine how to handle the conversion.

The sub function actually allows us to specify a function instead of a replacement string. If a function is specified, it’ll be called to create the replacement string for each match.

def replace_date(match):
    month, day, year = match.groups()
    if len(year) == 4:
        year = year
    elif '00' <= year < '50':
        year = '20' + year
    elif '50' <= year <= '99':
        year = '19' + year
    return '-'.join((year, month, day))

DATE_RE = re.compile(r'\b(\d{2})/(\d{2})/(\d{2}|\d{4})\b')

We could can now test this out like this:

>>> sentence = "from 12/22/1629 to 11/14/1643"
>>> DATE_RE.sub(replace_date, sentence)
'from 1629-12-22 to 1643-11-14'
>>> DATE_RE.sub(replace_date, "Nevermind (09/24/91) and Lemonade (04/23/16)")
'Nevermind (1991-09-24) and Lemonade (2016-04-23)'

Substitutions don’t usually need functions, but if you need to do a complex substitution it can come in handy.

Lookahead

Let’s make a regular expressions that finds all words that appear more than once in a string.

For all purposes, we’ll treat a word as one or more “word” characters surrounded by word breaks:

>>> sentence = "Oh what a day, what a lovely day!"
>>> re.findall(r'\b\w+\b', sentence)
['Oh', 'what', 'a', 'day', 'what', 'a', 'lovely', 'day']

To find words that appear twice we could try doing this:

>>> re.findall(r'\b(\w+)\b.*\b\1\b', sentence)
['what']

That finds “what” but it doesn’t find “a” or “day”. The reason for this is that this match consumes every character between the first two “what”s.

Regular expressions only run through a string one time when searching.

We need a way to find out that there word occurs a second time without actually consuming any more characters. For this we can use a lookahead.

>>> re.findall(r'\b(\w+)\b(?=.*\b\1\b)', sentence)
['what', 'a', 'day']

We’ve used a positive lookahead here. That means that it’ll match successfully if our word is followed by any characters as well as itself later on. The (?=...) doesn’t actually consume any characters though. Let’s talk about what that means.

When we match a character, we consume it: meaning we restart our matching after that character. Here we can see finding letters followed by x actually consumes the x as well:

>>> re.findall(r'(.)x', 'axxx')
['a', 'x']

So this is repeatedly matching any letter and the letter x. Notice that because both of the two letters are consumed, when an x is followed by another x, only one of them is matched because both get consumed during the match.

If we use a lookahead for the letter x, it won’t be consumed so we’ll properly be matching each letter followed by an x (including other x’s) this way:

>>> re.findall(r'(.)(?=x)', 'axxx')
['a', 'x', 'x']

Note that anchors like ^, $, and \b do not consume characters either.

Negative Lookahead

What if we want to write a regular expression that makes sure our string contains at least two different letters.

>>> re.search(r'[a-z].*[a-z]', 'aa', re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 2), match='aa'>

That doesn’t work because it doesn’t make sure the letters are different.

We need some way to tell the regular expression engine that the second letter should not be the same as the first.

We already know how to write a regular expression that makes sure the two letters are the same:

>>> re.search(r'([a-z]).*\1', 'aa', re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 2), match='aa'>
>>> re.search(r'([a-z]).*\1', 'a a', re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 3), match='a a'>
>>> re.search(r'([a-z]).*\1', 'a b', re.IGNORECASE)

We can use a negative lookahead to make sure the two letters found are different.

>>> re.search(r'([a-z]).*(?!\1)[a-z]', 'aa', re.IGNORECASE)
>>> re.search(r'([a-z]).*(?!\1)[a-z]', 'ab', re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 2), match='ab'>
>>> re.search(r'([a-z]).*(?!\1)[a-z]', 'a b', re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 3), match='a b'>
>>> re.search(r'([a-z]).*(?!\1)[a-z]', 'a a', re.IGNORECASE)
>>> re.search(r'([a-z]).*(?!\1)[a-z]', 'a ab', re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 4), match='a ab'>

✕

↑

Write more Pythonic code

I send out 1 Python exercise every week through a Python skill-building service called Python Morsels.

If you'd like to improve your Python skills every week, sign up!

You can find the Privacy Policy here.
reCAPTCHA protected (Google Privacy Policy & TOS)