Regular expressions, also known as regex or regexp, are sequences of characters that define a search pattern used mainly for string pattern matching and manipulation. Regex allows developers to check if a string contains a specific pattern or matches a predefined condition.
Mastering regular expressions is an essential skill for any Python developer. Regex knowledge comes in handy for tasks like data wrangling, log parsing, data scraping, validating user inputs, and much more. This comprehensive guide will walk through the fundamentals of regex matching in Python and provide an implementation of a function to check if a string matches a given regular expression.
Table of Contents
Open Table of Contents
- Overview of Regular Expressions
- Matching a String Against a Regex in Python
- Function to Check Regex Match
- Handling Regex Flags
- Matching Multiple Patterns
- Extracting Matching Substrings
- Matching Multiple Groups
- Lookahead and Lookbehind
- Greedy vs Lazy Matching
- Raw String Notation
- Common Regex Tasks and Examples
- Regex Limitations
- Conclusion
Overview of Regular Expressions
A regular expression consists of a combination of literal characters and metacharacters that enable flexible matching of text patterns. Some common metacharacters used in regex include:
.
- Matches any single character except newline*
- Matches zero or more repetitions of the preceding regex+
- Matches one or more repetitions of the preceding regex?
- Makes the preceding regex optional{n,m}
- Matches at leastn
but not more thanm
repetitions of the preceding regex[abc]
- Matches a, b, or c(regex)
- Groups regex and captures the match
For example, the regex a.c
will match strings like abc
, aac
, etc. And a*
will match a
, aa
, aaa
, etc.
Python has built-in support for regular expressions through the re
module. We can use re
functions like match()
, search()
, findall()
, etc. to perform regex operations on strings.
Matching a String Against a Regex in Python
The re.match()
function in Python checks if the beginning of a string matches the given regular expression. It returns a match object if the regex matches the string, or None if it does not.
import re
pattern = r'foo'
string = 'foo bar'
match = re.match(pattern, string)
if match:
print('Match found')
else:
print('No match')
This will print “Match found” since the start of the string ‘foo bar’ matches the pattern ‘foo’.
The match()
function matches only at the start of the string. To search for a regex match anywhere in the string, we can use re.search()
instead.
match = re.search(pattern, string)
if match:
print('Match found')
Function to Check Regex Match
We can wrap this matching logic in a function that accepts the regex pattern and input string as arguments and returns a Boolean indicating if the string matches the pattern or not.
import re
def regex_match(pattern, string):
match = re.search(pattern, string)
return bool(match)
print(regex_match(r'foo', 'foo bar')) # True
print(regex_match(r'baz', 'foo bar')) # False
This basic implementation searches the entire input string for a match against the regex pattern passed in.
Some key points to note:
-
The
re.search()
method is used here instead ofmatch()
to allow matching anywhere in the string instead of just the start. -
The return value of
re.search()
is converted to a Boolean usingbool()
. This returnsTrue
if a match is found,False
otherwise. -
The regex pattern is passed in as a raw string
r'somepattern'
to avoid having to escape special characters.
Handling Regex Flags
Regex patterns can take optional flags that modify the matching behavior. For example:
re.I
- Makes the matching case-insensitivere.M
- Enables multiline mode where^
and$
match the start and end of each line instead of the whole string
We can update our function to accept an optional flags
parameter and pass it to the re.search()
call:
import re
def regex_match(pattern, string, flags=0):
match = re.search(pattern, string, flags)
return bool(match)
# Case-insensitive search
print(regex_match(r'foo', 'Food bar', re.I)) # True
# Multiline mode
s = '''foo
BAR
baz'''
print(regex_match(r'^bar', s, re.M)) # True
Now we can specify re.I
, re.M
, etc. inside the flags
parameter to modify the regex matching accordingly.
Matching Multiple Patterns
Sometimes we may want to check if a string matches any one of several regex patterns. This can be done by checking each pattern one by one:
patt1 = r'foo'
patt2 = r'\d{3}'
patt3 = r'baz'
string = 'foo 123'
if regex_match(patt1, string) or regex_match(patt2, string) or regex_match(patt3, string):
print('Match found')
else:
print('No match')
However, a better way is to join the patterns using the OR |
operator inside a group:
patt = r'(foo|\d{3}|baz)'
if regex_match(patt, string):
print('Match found')
else:
print('No match')
This allows matching against multiple patterns much more concisely in one call.
Extracting Matching Substrings
In addition to checking for a match, we often need to extract the actual matching portion of the string.
The match
object returned by re.search()
contains information about the match, including the matching substring itself.
We can update our function to return this match object:
import re
def regex_match(pattern, string, flags=0):
match = re.search(pattern, string, flags)
return match
match = regex_match(r'\d+', 'foo 123 bar')
if match:
print(match.group()) # 123
The match.group()
method returns the substring that was matched by the pattern.
So our function now returns the match object on success, which allows extracting the matching portion from the input string.
Matching Multiple Groups
Parentheses in a regex pattern denote capturing groups that allow extracting multiple matched substrings.
For example:
match = regex_match(r'(\w+), (\d+)', 'foo, 123')
if match:
print(match.group(1)) # foo
print(match.group(2)) # 123
Here (\w+)
and (\d+)
are two capturing groups that match word characters and digits respectively.
We can access the strings matched by each group using match.group(n)
, where n
is the group number starting from 1.
Named capturing groups can also be used for better readability:
patt = r'(?P<name>\w+), (?P<num>\d+)'
match = regex_match(patt, 'foo, 123')
if match:
print(match.group('name')) # foo
print(match.group('num')) # 123
This allows extracting substrings matched by different regex groups easily.
Lookahead and Lookbehind
Additional regex capabilities like lookahead and lookbehind assertions can be used to match patterns based on what precedes or follows them, without including those portions in the match.
Positive lookahead (?=...)
asserts that the given pattern must follow at the current position, without matching it.
patt = r'\w+(?=, \d+)'
match = regex_match(patt, 'foo, 123')
print(match.group()) # foo
The positive lookahead (?=, \d+)
ensures what follows is a comma and digits, but does not include them in the match.
Similarly, positive lookbehind (?<=...)
asserts that the given pattern must precede the current position.
patt = r'(?<=\w), \d+'
match = regex_match(patt, 'foo, 123')
print(match.group()) # , 123
The lookbehind (?<=\w)
asserts precedence of a word character, but does not include it in the match.
Regex lookarounds like these provide powerful matching capabilities.
Greedy vs Lazy Matching
The repetition quantifiers *
, +
and {}
perform greedy matching in regex, matching as many instances as possible.
Adding a ?
after them makes the quantifier lazy, matching as few instances as needed.
For example:
# Greedy
patt = r'<.+>'
print(regex_match(patt, '<tag1>foo</tag1>').group()) # <tag1>foo</tag1>
# Lazy
patt = r'<.+?>'
print(regex_match(patt, '<tag1>foo</tag1>').group()) # <tag1>
The greedy .+
matches all characters between <
and >
, while the lazy .+?
matches just until the first >
.
Raw String Notation
Python’s raw string notation r''
is an important regex tool. Backslashes (\
) have special meaning both in regex and as Python string escapes.
Using raw strings prevents having to doubly escape backslashes.
For example:
# Without raw string
patt = '\\w+'
# With raw string
patt = r'\w+'
Raw strings simplify working with regex patterns in Python.
Common Regex Tasks and Examples
Some examples of common regex use cases:
Pattern Matching
Check if a string contains digits:
patt = r'\d'
regex_match(patt, 'this string has 123 numbers') # True
Validate phone numbers:
phone_patt = r'\d{3}-\d{3}-\d{4}'
regex_match(phone_patt, '123-456-7890') # True
Search and Extract
Extract all email IDs from text:
emails = re.findall(r'[\w\.-]+@[\w\.-]+', text)
Find prices from product descriptions:
prices = re.findall(r'£\d+\.\d{2}', descriptions)
Replace and Modify
Strip whitespace from string:
re.sub(r'\s+', '', text)
Change date formats:
date_str = re.sub(r'(\d{2})-(\d{2})-(\d{4})', r'\3-\1-\2', '12-31-2022')
Regex Limitations
While regular expressions are versatile, some limitations exist:
-
Readability: Complex regex can be hard to decipher and maintain. Verbose patterns should be split into simpler parts.
-
Greediness: Greedy quantifiers often match more than needed, requiring lazy modifiers or lookarounds to restrict matches.
-
Backtracking: Regex engines perform backtracking to return possible matches, which can cause performance issues.
-
Escape sequences: Lots of backslash escaping is needed to represent special characters, raw string literals help avoid this.
-
Unicode: Unicode grapheme boundaries and text shaping rules are not handled by default. The
re
module’s Unicode support is limited. -
Recursion: Python’s
re
module limits recursion depth to avoid crashes on highly recursive patterns.
Thoughtful design and testing is needed to create efficient regex that are easy to use and maintain.
Conclusion
This guide covered key aspects of implementing regular expression matching in Python. The re
module provides excellent regex capabilities for string pattern matching and extraction.
Some takeaways:
-
Use raw string notation
r''
for regex patterns to avoid excessive escaping -
re.search()
allows full string matching,re.match()
matches only at start -
Capture groups and lookarounds provide additional matching control
-
Greedy vs lazy quantifiers change how much is matched
-
Return match objects to extract substrings instead of just Booleans
-
Balance regex power and readability, test thoroughly for edge cases
Regex skills are invaluable for Python developers. Mastering regex matching unlocks powerful text processing capabilities and is a must for any Python programmer’s toolkit.