Grouping
Grouping
So far our regular expressions have consisted solely of commands that match individual letters or allow repetition of individual letter matches.
What if we want to act on a group?
For example what if we want to match US ZIP codes in their shortened form or their full form?
We’ve already matched shortened ZIP codes:
>>> re.search(r'^\d{5}$', '90210')
<_sre.SRE_Match object; span=(0, 5), match='90210'>
A full ZIP code match looks like this:
>>> re.search(r'^\d{5}-\d{4}$', '90210-4873')
<_sre.SRE_Match object; span=(0, 10), match='90210-4873'>
So far we haven’t seen a way to make that last part optional.
We could try putting a question mark after the -
and the repetition:
>>> re.search(r'^\d{5}-?\d{4}?$', '90210-4873')
<_sre.SRE_Match object; span=(0, 10), match='90210-4873'>
>>> re.search(r'^\d{5}-?\d{4}?$', '902104873')
<_sre.SRE_Match object; span=(0, 9), match='902104873'>
>>> re.search(r'^\d{5}-?\d{4}?$', '90210-')
That matches strange things though (also what’s up with that ?
after the repetition count?).
To optionally match a number of consecutive character patterns, we can use a group:
>>> re.search(r'^\d{5}(-\d{4})?$', '90210-4873')
<_sre.SRE_Match object; span=(0, 10), match='90210-4873'>
>>> re.search(r'^\d{5}(-\d{4})?$', '90210')
<_sre.SRE_Match object; span=(0, 5), match='90210'>
>>> re.search(r'^\d{5}(-\d{4})?$', '902104873')
>>> re.search(r'^\d{5}(-\d{4})?$', '90210-')
This allows us to match 5 digits followed optionally by a dash and 4 digits (both the dash and 4 digits must be present).
Capture Groups
We’ve already talked about using groups to allow for quantifying a group of character patterns.
There’s actually another purpose for groups though.
Groups also allow capturing characters matched by a group.
Remember how we used the group
method to access the matched data? We can pass arguments to that method to access captured groups.
For example, in our ZIP code regular expression, we can get the first matching group like this:
>>> m = re.search(r'^\d{5}(-\d{4})?$', '90210-4873')
>>> m.group(1)
'-4873'
>>> m.group()
'90210-4873'
If we want to always access just the first 5 digits, we could put those in a group:
>>> m = re.search(r'(^\d{5})(-\d{4})?$', '90210-4873')
>>> m.group(2)
'-4873'
>>> m.group(1)
'90210'
Note that if we access the 0
group that will give us the entire match, just like when we pass no arguments:
>>> m.group(0)
'90210-4873'
>>> m.group()
'90210-4873'
Uncapturing Groups
There’s a important caveat to be aware of when using findall
. If we have capturing groups in our expression, the full match won’t be returned:
>>> re.findall(r'\d{5}(-\d{4})?', '90210-4873\n12345')
['-4873', '']
If there is a single capturing groups, the findall
function returns the contents of that group.
If there are multiple capturing groups, the findall
function returns a tuple of all the group contents:
>>> re.findall(r'(\d{5})(-\d{4})?', '90210-4873\n12345')
[('90210', '-4873'), ('12345', '')]
>>> re.findall(r'(\d{5}(-\d{4})?)', '90210-4873\n12345')
[('90210-4873', '-4873'), ('12345', '')]
Remember that groups have two purposes:
Grouping patterns
Capturing the strings that were matched by the group
What if we want to use parenthesis to make a group, but we don’t want to capture in our group?
We can use an uncapturing group for this.
Uncapturing groups have a really weird syntax.
You’ll probably want to refer to the cheat sheet when you realize you need to use them. To make an uncapturing group you put ?:
(a question mark and a colon) after the opening parenthesis for the group.
>>> re.findall(r'\d{5}(?:-\d{4})?', '90210-4873\n12345')
['90210-4873', '12345']
The reason this syntax is so weird is that the creators wanted to maintain backwards compatibility and this particular syntax was invalid in regular expression parses up to that point (?
makes no sense after (
normally).
Hopefully you won’t need this feature often.
More Search Exercises
These exercises are all in the grouping.py
file in the exercises
directory.
Edit the appropriate function in that file to complete each exercise.
To run the tests, from the exercises
folder, type python test.py <function_name>
, like this:
$ python test.py get_extension
Note
Most of these exercises involves searching in a dictionary.
You can find the contents of this dictionary file in the dictionary
variable within the grouping
module.
Get File Extension
Make a function that accepts a full file path and returns the file extension.
Tip
Modify the get_extension
function in the grouping
module.
Example usage:
>>> get_extension('archive.zip')
'zip'
>>> get_extension('image.jpeg')
'jpeg'
>>> get_extension('index.xhtml')
'xhtml'
>>> get_extension('archive.tar.gz')
'gz'
Hexadecimal Words
Find every word that consists solely of the letters A, B, C, D, E, and F.
The input is a variable containing all the words in the file dictionary.txt
.
Tip
Modify the hexadecimal
function in the grouping
module.
Examples: decaf, bead, cab
>>> hexadecimal(dictionary) ['abbe', 'abed', 'accede', 'acceded', 'ace', 'aced', 'ad', 'add', 'added', 'baa', 'baaed', 'babe', 'bad', 'bade', 'be', 'bead', 'beaded', 'bed', 'bedded', 'bee', 'beef', 'beefed', 'cab', 'cabbed', 'cad', 'cafe', 'ceca', 'cede', 'ceded', 'dab', 'dabbed', 'dace', 'dad', 'dead', 'deaf', 'deb', 'decade', 'decaf', 'deed', 'deeded', 'def', 'deface', 'defaced', 'ebb', 'ebbed', 'ed', 'efface', 'effaced', 'fa', 'facade', 'face', 'faced', 'fad', 'fade', 'faded', 'fed', 'fee', 'feed']
Tetravocalic
Find all words that include four consecutive vowels.
The input is a variable containing all the words in the file dictionary.txt
.
Tip
Modify the tetravocalic
function in the grouping
module.
>>> tetravocalic(dictionary)
['aqueous', 'aqueously', 'archaeoastronomies', 'archaeoastronomy', 'assegaaied', 'assegaaiing', 'banlieue', 'banlieues', 'beauish', 'bioaeration', 'bioaerations', 'bioaeronautics', 'blooie', 'booai', 'booais', 'braaied', 'braaiing', 'camaieu', 'camaieux', 'cooee', 'cooeed', 'cooeeing', 'cooees', 'dequeue', 'dequeued', 'dequeueing', 'dequeues', 'dequeuing', 'enqueue', 'enqueued', 'enqueueing', 'enqueues', 'enqueuing', 'epigaeous', 'epopoeia', 'epopoeias', 'euoi', 'euouae', 'euouaes', 'flooie', 'forhooie', 'forhooied', 'forhooieing', 'forhooies', 'giaour', 'giaours', 'gooier', 'gooiest', 'guaiac', 'guaiacol', 'guaiacols', 'guaiacs', 'guaiacum', 'guaiacums', 'guaiocum', 'guaiocums', 'homoiousian', 'homoiousians', 'hypoaeolian', 'hypogaeous', 'looie', 'looies', 'louie', 'louies', 'maieutic', 'maieutical', 'maieutics', 'meoued', 'meouing', 'metasequoia', 'metasequoias', 'miaou', 'miaoued', 'miaouing', 'miaous', 'mythopoeia', 'mythopoeias', 'nonaqueous', 'obsequious', 'obsequiously', 'obsequiousness', 'obsequiousnesses', 'onomatopoeia', 'onomatopoeias', 'palaeoanthropic', 'palaeoecologic', 'palaeoecologies', 'palaeoecologist', 'palaeoecology', 'palaeoethnology', 'pharmacopoeia', 'pharmacopoeial', 'pharmacopoeian', 'pharmacopoeias', 'plateaued', 'plateauing', 'prosopopoeia', 'prosopopoeial', 'prosopopoeias', 'queue', 'queued', 'queueing', 'queueings', 'queuer', 'queuers', 'queues', 'queuing', 'queuings', 'radioautograph', 'radioautographic', 'radioautographies', 'radioautographs', 'radioautography', 'radioiodine', 'radioiodines', 'reliquiae', 'rhythmopoeia', 'rhythmopoeias', 'saouari', 'saouaris', 'scarabaeoid', 'scarabaeoids', 'sequoia', 'sequoias', 'subaqueous', 'tenuious', 'terraqueous', 'toeier', 'toeiest', 'zoaea', 'zoaeae', 'zoaeas', 'zoeae', 'zooea', 'zooeae', 'zooeal', 'zooeas', 'zoogloeae', 'zoogloeoid', 'zooier', 'zooiest']
Hexaconsonantal
Find at least one word with 6 consecutive consonants. For this problem treat y
as a vowel.
The input is a variable containing all the words in the file dictionary.txt
.
Tip
Modify the hexaconsonantal
function in the grouping
module.
>>> re.findall(r"\b.*[^aeiouy\s]{6}.*\b", dictionary)
['bergschrund', 'bergschrunds', 'borschts', 'catchphrase', 'catchphrases', 'crwths', 'eschscholtzia', 'eschscholtzias', 'eschscholzia', 'eschscholzias', 'festschrift', 'festschriften', 'festschrifts', 'grrrls', 'latchstring', 'latchstrings', 'lengthsman', 'lengthsmen', 'sightscreen', 'sightscreens', 'tsktsk', 'tsktsked', 'tsktsking', 'tsktsks', 'watchspring', 'watchsprings', 'watchstrap', 'watchstraps', 'weltschmerz', 'weltschmerzes']
>>> re.findall(r'\b.*[bcdfghjklmnpqrstvwxz]{6}.*\b', dictionary)
['bergschrund', 'bergschrunds', 'borschts', 'catchphrase', 'catchphrases', 'crwths', 'eschscholtzia', 'eschscholtzias', 'eschscholzia', 'eschscholzias', 'festschrift', 'festschriften', 'festschrifts', 'grrrls', 'latchstring', 'latchstrings', 'lengthsman', 'lengthsmen', 'sightscreen', 'sightscreens', 'tsktsk', 'tsktsked', 'tsktsking', 'tsktsks', 'watchspring', 'watchsprings', 'watchstrap', 'watchstraps', 'weltschmerz', 'weltschmerzes']
Crossword Helper
Make a function possible_words
that accepts a partial word with underscores representing missing letters and returns a list of all possible matches.
Tip
Modify the possible_words
function in the grouping
module.
Use your crossword helper function to solve the following:
water tank: CIS____
pastry: ___TE
temporary: __A_S_E__
Repeat Letter
Find every word with 5 repeat letters.
The input is a variable containing all the words in the file dictionary.txt
.
Tip
Modify the five_repeats
function in the grouping
module.
>>> five_repeats(letter, dictionary)
['inconveniencing', 'nondenominational', 'nonindependent', 'nonintervention', 'noninterventions']
I send out 1 Python exercise every week through a Python skill-building service called Python Morsels.
If you'd like to improve your Python skills every week, sign up!
You can find the Privacy Policy here.reCAPTCHA protected (Google Privacy Policy & TOS)