Python - Regular Expression

Regular expressions are a powerful tool for working with text data in Python. They allow you to search for and match specific patterns of characters within strings.

In Python, regular expressions are supported by the re module. Here are some common functions and methods provided by this module:

  • re.compile(pattern, flags=0): Compiles a regular expression pattern into a regular expression object, which can be used for matching.
  • re.search(pattern, string, flags=0): Searches the string for a match to the regular expression pattern, returning a Match object if a match is found, or None otherwise.
  • re.match(pattern, string, flags=0): Attempts to match the regular expression pattern at the beginning of the string, returning a Match object if a match is found, or None otherwise.
  • re.findall(pattern, string, flags=0): Returns a list of all non-overlapping matches of the regular expression pattern within the string.
  • re.sub(pattern, repl, string, count=0, flags=0): Replaces all occurrences of the regular expression pattern in the string with the replacement string repl.
  • re.split(pattern, string, maxsplit=0, flags=0): Splits the string into a list of substrings at every occurrence of the regular expression pattern.

Regular expressions use special characters and symbols to represent different patterns. Here are some common ones:

  • .: Matches any single character except newline.
  • ^: Matches the beginning of the string.
  • $: Matches the end of the string.
  • *: Matches zero or more occurrences of the preceding character or group.
  • +: Matches one or more occurrences of the preceding character or group.
  • ?: Matches zero or one occurrence of the preceding character or group.
  • {m}: Matches exactly m occurrences of the preceding character or group.
  • {m,n}: Matches between m and n occurrences of the preceding character or group.
  • []: Matches any single character within the brackets.
  • |: Matches either the expression before or after the symbol.
  • (): Groups expressions together.

Regular expression modifiers, also known as option flags, allow you to modify the behavior of a regular expression. These modifiers are added to the end of a regular expression pattern, after the closing slash (/) character. Here are some of the most commonly used modifiers:

  • i (ignorecase): This modifier allows the regular expression to match both uppercase and lowercase letters. For example, the regular expression /hello/i will match "hello", "Hello", "HELLO", and so on.
  • m (multiline): This modifier allows the regular expression to match the beginning and end of each line in a multi-line string. For example, the regular expression /^hello/m will match "hello" at the beginning of any line in a multi-line string.
  • s (dotall): This modifier allows the dot (.) character to match any character, including newlines. For example, the regular expression /^hello.*world$/s will match "hello" followed by any characters (including newlines) and then "world" at the end of a string.
  • x (verbose): This modifier allows you to add comments and whitespace to your regular expression for better readability. For example, the regular expression /^ # Match the beginning of a line hello # Match the word 'hello' \s+ # Match one or more whitespace characters world # Match the word 'world' $ # Match the end of a line /x will match "hello" followed by one or more whitespace characters and then "world" at the end of a string, ignoring any comments and whitespace.

Here are some examples of regular expressions with modifiers:

import re
# Ignore case
pattern = re.compile("hello", re.IGNORECASE)
result = pattern.search("Hello, World!")
print(result.group()) # Output: Hello

# Multiline
pattern = re.compile("^hello", re.MULTILINE)
result = pattern.search("Hello\nWorld\n")
print(result.group()) # Output: Hello

# Dotall
pattern = re.compile("^hello.*world$", re.DOTALL)
result = pattern.search("hello\nworld")
print(result.group()) # Output: hello\nworld

# Verbose
pattern = re.compile("""
  ^         # Match the beginning of a line
  hello     # Match the word 'hello'
  \s+       # Match one or more whitespace characters
  world     # Match the word 'world'
  $         # Match the end of a line
""", re.VERBOSE)
result = pattern.search("hello     world")
print(result.group()) # Output: hello     world

Special character classes are predefined character classes that match a particular type of character in a string. Here are some of the most commonly used special character classes in regular expressions:

  • \d: Matches any decimal digit, equivalent to [0-9].
  •  
  • \D: Matches any non-digit character, equivalent to [^0-9].
  •  
  • \w: Matches any alphanumeric character, equivalent to [a-zA-Z0-9_].
  •  
  • \W: Matches any non-alphanumeric character, equivalent to [^a-zA-Z0-9_].
  •  
  • \s: Matches any whitespace character, equivalent to [\t\n\r\f\v].
  •  
  • \S: Matches any non-whitespace character, equivalent to [^\t\n\r\f\v].

Repetition in regular expressions is used to specify how many times a particular pattern or group of characters should occur. There are several repetition cases that can be used in regular expressions:

  • The asterisk () matches zero or more occurrences of the preceding character or group. For example, the regular expression "ab" would match "a", "ab", "abb", "abbb", and so on.
  • The plus sign (+) matches one or more occurrences of the preceding character or group. For example, the regular expression "ab+" would match "ab", "abb", "abbb", and so on, but would not match "a".
  • The question mark (?) matches zero or one occurrences of the preceding character or group. For example, the regular expression "ab?" would match "a" and "ab", but would not match "abb".
  • The curly braces ({}) can be used to specify a specific number of occurrences. For example, the regular expression "a{3}" would match "aaa", but would not match "aa" or "aaaa".
  • The curly braces with a range ({m,n}) can be used to specify a range of occurrences. For example, the regular expression "a{2,4}" would match "aa", "aaa", and "aaaa", but would not match "a" or "aaaaa".
  • The caret (^) can be used to match the beginning of a line or string. For example, the regular expression "^a" would match "a" at the beginning of a line or string.
  • The dollar sign ($) can be used to match the end of a line or string. For example, the regular expression "a$" would match "a" at the end of a line or string.

These repetition cases can be combined with other regular expression elements to create powerful and flexible patterns for matching text.

Grouping with parentheses is a powerful feature of regular expressions that allows you to capture and manipulate sub-patterns within a larger pattern. When you enclose a sub-pattern within parentheses, it becomes a group.

Here are a few examples of how grouping with parentheses works in Python regular expressions:

Matching repeated substrings: Suppose you want to match a string that contains two adjacent repeated substrings, such as "hellohello" or "worldworld". You can use parentheses to group the first substring and then reference it with a backreference, like this:

import re
pattern = r'(\w+)\1'
text = 'hellohello worldworld'
matches = re.findall(pattern, text)
print(matches)  # Output: ['hello', 'world']

The \w+ pattern matches one or more word characters (letters, digits, or underscores), and the (\w+) parentheses group captures the first substring. The \1 backreference then matches the same sequence of characters as the first group.

Extracting substrings: You can also use parentheses to extract substrings from a larger string. For example, suppose you have a string that contains a person's name and email address, separated by a comma:

import re
pattern = r'(\w+), (\w+) (\w+) \((\w+)\)'
text = 'Doe, John Smith ([email protected])'
match = re.search(pattern, text)
if match:
    last_name = match.group(1)
    first_name = match.group(2)
    middle_name = match.group(3)
    email = match.group(4)
    print(last_name, first_name, middle_name, email)

The pattern (\w+), (\w+) (\w+) \((\w+)\) matches a last name followed by a comma, a first name, a middle name, and an email address enclosed in parentheses. The parentheses groups capture the four substrings, which are then accessed using the group() method of the match object.

Conditional expressions: You can use the | operator to create a conditional expression that matches one of two alternative patterns. For example, suppose you want to match a string that starts with either "http://" or "https://". You can use parentheses to group the two alternatives and the | operator to create a conditional expression:

import re
pattern = r'(http|https)://\w+\.\w+'
text = 'Visit our website at https://example.com'
match = re.search(pattern, text)
if match:
    protocol = match.group(1)
    print(protocol)

The pattern (http|https)://\w+\.\w+ matches either "http://" or "https://" followed by a domain name. The parentheses group the two alternatives, and the | operator specifies the choice between them. The group() method returns the matched alternative.