Grouping

Grouping

So far our regular expressions have consisted solely of commands that match individual letters or allow repetition of individual letter matches.

What if we want to act on a group?

For example what if we want to match US ZIP codes in their shortened form or their full form?

We’ve already matched shortened ZIP codes:

>>> re.search(r'^\d{5}$', '90210')
<_sre.SRE_Match object; span=(0, 5), match='90210'>

A full ZIP code match looks like this:

>>> re.search(r'^\d{5}-\d{4}$', '90210-4873')
<_sre.SRE_Match object; span=(0, 10), match='90210-4873'>

So far we haven’t seen a way to make that last part optional.

We could try putting a question mark after the - and the repetition:

>>> re.search(r'^\d{5}-?\d{4}?$', '90210-4873')
<_sre.SRE_Match object; span=(0, 10), match='90210-4873'>
>>> re.search(r'^\d{5}-?\d{4}?$', '902104873')
<_sre.SRE_Match object; span=(0, 9), match='902104873'>
>>> re.search(r'^\d{5}-?\d{4}?$', '90210-')

That matches strange things though (also what’s up with that ? after the repetition count?).

To optionally match a number of consecutive character patterns, we can use a group:

>>> re.search(r'^\d{5}(-\d{4})?$', '90210-4873')
<_sre.SRE_Match object; span=(0, 10), match='90210-4873'>
>>> re.search(r'^\d{5}(-\d{4})?$', '90210')
<_sre.SRE_Match object; span=(0, 5), match='90210'>
>>> re.search(r'^\d{5}(-\d{4})?$', '902104873')
>>> re.search(r'^\d{5}(-\d{4})?$', '90210-')

This allows us to match 5 digits followed optionally by a dash and 4 digits (both the dash and 4 digits must be present).

Capture Groups

We’ve already talked about using groups to allow for quantifying a group of character patterns.

There’s actually another purpose for groups though.

Groups also allow capturing characters matched by a group.

Remember how we used the group method to access the matched data? We can pass arguments to that method to access captured groups.

For example, in our ZIP code regular expression, we can get the first matching group like this:

>>> m = re.search(r'^\d{5}(-\d{4})?$', '90210-4873')
>>> m.group(1)
'-4873'

>>> m.group()
'90210-4873'

If we want to always access just the first 5 digits, we could put those in a group:

>>> m = re.search(r'(^\d{5})(-\d{4})?$', '90210-4873')
>>> m.group(2)
'-4873'
>>> m.group(1)
'90210'

Note that if we access the 0 group that will give us the entire match, just like when we pass no arguments:

>>> m.group(0)
'90210-4873'
>>> m.group()
'90210-4873'

Uncapturing Groups

There’s a important caveat to be aware of when using findall. If we have capturing groups in our expression, the full match won’t be returned:

>>> re.findall(r'\d{5}(-\d{4})?', '90210-4873\n12345')
['-4873', '']

If there is a single capturing groups, the findall function returns the contents of that group.

If there are multiple capturing groups, the findall function returns a tuple of all the group contents:

>>> re.findall(r'(\d{5})(-\d{4})?', '90210-4873\n12345')
[('90210', '-4873'), ('12345', '')]
>>> re.findall(r'(\d{5}(-\d{4})?)', '90210-4873\n12345')
[('90210-4873', '-4873'), ('12345', '')]

Remember that groups have two purposes:

  1. Grouping patterns

  2. Capturing the strings that were matched by the group

What if we want to use parenthesis to make a group, but we don’t want to capture in our group?

We can use an uncapturing group for this.

Uncapturing groups have a really weird syntax.

You’ll probably want to refer to the cheat sheet when you realize you need to use them. To make an uncapturing group you put ?: (a question mark and a colon) after the opening parenthesis for the group.

>>> re.findall(r'\d{5}(?:-\d{4})?', '90210-4873\n12345')
['90210-4873', '12345']

The reason this syntax is so weird is that the creators wanted to maintain backwards compatibility and this particular syntax was invalid in regular expression parses up to that point (? makes no sense after ( normally).

Hopefully you won’t need this feature often.

More Search Exercises

These exercises are all in the grouping.py file in the exercises directory. Edit the appropriate function in that file to complete each exercise. To run the tests, from the exercises folder, type python test.py <function_name>, like this:

$ python test.py get_extension

Note

Most of these exercises involves searching in a dictionary. You can find the contents of this dictionary file in the dictionary variable within the grouping module.

Get File Extension

Make a function that accepts a full file path and returns the file extension.

Tip

Modify the get_extension function in the grouping module.

Example usage:

>>> get_extension('archive.zip')
'zip'
>>> get_extension('image.jpeg')
'jpeg'
>>> get_extension('index.xhtml')
'xhtml'
>>> get_extension('archive.tar.gz')
'gz'

Hexadecimal Words

Find every word that consists solely of the letters A, B, C, D, E, and F. The input is a variable containing all the words in the file dictionary.txt.

Tip

Modify the hexadecimal function in the grouping module.

Examples: decaf, bead, cab

>>> hexadecimal(dictionary)
['abbe', 'abed', 'accede', 'acceded', 'ace', 'aced', 'ad', 'add', 'added', 'baa', 'baaed', 'babe', 'bad', 'bade', 'be', 'bead', 'beaded', 'bed', 'bedded', 'bee', 'beef', 'beefed', 'cab', 'cabbed', 'cad', 'cafe', 'ceca', 'cede', 'ceded', 'dab', 'dabbed', 'dace', 'dad', 'dead', 'deaf', 'deb', 'decade', 'decaf', 'deed', 'deeded', 'def', 'deface', 'defaced', 'ebb', 'ebbed', 'ed', 'efface', 'effaced', 'fa', 'facade', 'face', 'faced', 'fad', 'fade', 'faded', 'fed', 'fee', 'feed']

Tetravocalic

Find all words that include four consecutive vowels. The input is a variable containing all the words in the file dictionary.txt.

Tip

Modify the tetravocalic function in the grouping module.

>>> tetravocalic(dictionary)
['aqueous', 'aqueously', 'archaeoastronomies', 'archaeoastronomy', 'assegaaied', 'assegaaiing', 'banlieue', 'banlieues', 'beauish', 'bioaeration', 'bioaerations', 'bioaeronautics', 'blooie', 'booai', 'booais', 'braaied', 'braaiing', 'camaieu', 'camaieux', 'cooee', 'cooeed', 'cooeeing', 'cooees', 'dequeue', 'dequeued', 'dequeueing', 'dequeues', 'dequeuing', 'enqueue', 'enqueued', 'enqueueing', 'enqueues', 'enqueuing', 'epigaeous', 'epopoeia', 'epopoeias', 'euoi', 'euouae', 'euouaes', 'flooie', 'forhooie', 'forhooied', 'forhooieing', 'forhooies', 'giaour', 'giaours', 'gooier', 'gooiest', 'guaiac', 'guaiacol', 'guaiacols', 'guaiacs', 'guaiacum', 'guaiacums', 'guaiocum', 'guaiocums', 'homoiousian', 'homoiousians', 'hypoaeolian', 'hypogaeous', 'looie', 'looies', 'louie', 'louies', 'maieutic', 'maieutical', 'maieutics', 'meoued', 'meouing', 'metasequoia', 'metasequoias', 'miaou', 'miaoued', 'miaouing', 'miaous', 'mythopoeia', 'mythopoeias', 'nonaqueous', 'obsequious', 'obsequiously', 'obsequiousness', 'obsequiousnesses', 'onomatopoeia', 'onomatopoeias', 'palaeoanthropic', 'palaeoecologic', 'palaeoecologies', 'palaeoecologist', 'palaeoecology', 'palaeoethnology', 'pharmacopoeia', 'pharmacopoeial', 'pharmacopoeian', 'pharmacopoeias', 'plateaued', 'plateauing', 'prosopopoeia', 'prosopopoeial', 'prosopopoeias', 'queue', 'queued', 'queueing', 'queueings', 'queuer', 'queuers', 'queues', 'queuing', 'queuings', 'radioautograph', 'radioautographic', 'radioautographies', 'radioautographs', 'radioautography', 'radioiodine', 'radioiodines', 'reliquiae', 'rhythmopoeia', 'rhythmopoeias', 'saouari', 'saouaris', 'scarabaeoid', 'scarabaeoids', 'sequoia', 'sequoias', 'subaqueous', 'tenuious', 'terraqueous', 'toeier', 'toeiest', 'zoaea', 'zoaeae', 'zoaeas', 'zoeae', 'zooea', 'zooeae', 'zooeal', 'zooeas', 'zoogloeae', 'zoogloeoid', 'zooier', 'zooiest']

Hexaconsonantal

Find at least one word with 6 consecutive consonants. For this problem treat y as a vowel. The input is a variable containing all the words in the file dictionary.txt.

Tip

Modify the hexaconsonantal function in the grouping module.

>>> re.findall(r"\b.*[^aeiouy\s]{6}.*\b", dictionary)
['bergschrund', 'bergschrunds', 'borschts', 'catchphrase', 'catchphrases', 'crwths', 'eschscholtzia', 'eschscholtzias', 'eschscholzia', 'eschscholzias', 'festschrift', 'festschriften', 'festschrifts', 'grrrls', 'latchstring', 'latchstrings', 'lengthsman', 'lengthsmen', 'sightscreen', 'sightscreens', 'tsktsk', 'tsktsked', 'tsktsking', 'tsktsks', 'watchspring', 'watchsprings', 'watchstrap', 'watchstraps', 'weltschmerz', 'weltschmerzes']
>>> re.findall(r'\b.*[bcdfghjklmnpqrstvwxz]{6}.*\b', dictionary)
['bergschrund', 'bergschrunds', 'borschts', 'catchphrase', 'catchphrases', 'crwths', 'eschscholtzia', 'eschscholtzias', 'eschscholzia', 'eschscholzias', 'festschrift', 'festschriften', 'festschrifts', 'grrrls', 'latchstring', 'latchstrings', 'lengthsman', 'lengthsmen', 'sightscreen', 'sightscreens', 'tsktsk', 'tsktsked', 'tsktsking', 'tsktsks', 'watchspring', 'watchsprings', 'watchstrap', 'watchstraps', 'weltschmerz', 'weltschmerzes']

Crossword Helper

Make a function possible_words that accepts a partial word with underscores representing missing letters and returns a list of all possible matches.

Tip

Modify the possible_words function in the grouping module.

Use your crossword helper function to solve the following:

  1. water tank: CIS____

  2. pastry: ___TE

  3. temporary: __A_S_E__

Repeat Letter

Find every word with 5 repeat letters. The input is a variable containing all the words in the file dictionary.txt.

Tip

Modify the five_repeats function in the grouping module.

>>> five_repeats(letter, dictionary)
['inconveniencing', 'nondenominational', 'nonindependent', 'nonintervention', 'noninterventions']
Write more Pythonic code

I send out 1 Python exercise every week through a Python skill-building service called Python Morsels.

If you'd like to improve your Python skills every week, sign up!

You can find the Privacy Policy here.
reCAPTCHA protected (Google Privacy Policy & TOS)