Skip to main content

Python Pandas: Filter DataFrame Rows Using Regular Expressions (Regex)

Filtering rows in a Pandas DataFrame based on complex string patterns is a common data manipulation task. While simple substring checks can be done with methods like str.startswith() or str.endswith(), regular expressions (regex) offer much more power and flexibility for matching intricate patterns within your string data.

This guide explains how to use the Series.str.contains() method with regular expressions to filter DataFrame rows, including case-insensitive matching and negation.

Why Use Regex for Filtering Pandas Rows?

Regular expressions allow you to define sophisticated search patterns that go beyond simple literal string matching. You can:

  • Match specific character sets (e.g., digits, letters, whitespace).
  • Define positions (start/end of string).
  • Specify repetitions (e.g., one or more occurrences).
  • Use OR conditions within a pattern.
  • Capture groups.

This makes regex invaluable when your filtering criteria are complex.

Example DataFrame:

import pandas as pd
import re # For re.IGNORECASE flag

data = {
'ID': ['X101', 'Y203A', 'X105B', 'Z300', 'Y203C'],
'Product Name': ['Alpha Widget', 'Beta Gadget', 'alpha gizmo', 'Gamma Unit', 'Delta Device'],
'Description': ['High-quality alpha series', 'Advanced beta model with extras', 'Compact gizmo for Alphas', 'Standard gamma component', 'Next-gen delta system']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Output:

Original DataFrame:
ID Product Name Description
0 X101 Alpha Widget High-quality alpha series
1 Y203A Beta Gadget Advanced beta model with extras
2 X105B alpha gizmo Compact gizmo for Alphas
3 Z300 Gamma Unit Standard gamma component
4 Y203C Delta Device Next-gen delta system

The Core Method: Series.str.contains(regex_pattern)

To filter rows using regex, you apply the .str.contains() method to the specific Pandas Series (column) you want to test.

  1. Access the column: df['YourColumnName'].
  2. Use the .str accessor to use string methods.
  3. Call .contains(pattern, case=True, flags=0, na=None, regex=True).
    • pattern: Your regular expression string.
    • case: Boolean, default True (case-sensitive). Set to False for case-insensitive matching.
    • flags: Integer, regex compilation flags (e.g., re.IGNORECASE).
    • na: Value to return for missing (NaN) values in the Series.
    • regex: Boolean, default True. If False, pattern is treated as a literal string.

This method returns a boolean Series (True where the pattern matches, False otherwise). You then use this boolean Series to index the original DataFrame.

Filtering Rows Where a Column Matches a Regex

Example: Rows Starting With a Pattern

Let's select rows where the 'Product Name' starts with "Alpha".

import pandas as pd

df = pd.DataFrame({
'ID': ['X101', 'Y203A', 'X105B', 'Z300', 'Y203C'],
'Product Name': ['Alpha Widget', 'Beta Gadget', 'alpha gizmo', 'Gamma Unit', 'Delta Device'],
'Description': ['High-quality alpha series', 'Advanced beta model with extras', 'Compact gizmo for Alphas', 'Standard gamma component', 'Next-gen delta system']
})


# Regex pattern: '^' anchors to the start of the string
regex_starts_with_Alpha = r'^Alpha'

# Apply .str.contains() to get a boolean mask
mask_starts_with_Alpha = df['Product Name'].str.contains(regex_starts_with_Alpha)
print("Boolean mask for 'starts with Alpha':")
print(mask_starts_with_Alpha)

# ✅ Use the mask to filter the DataFrame
df_starts_with_Alpha = df[mask_starts_with_Alpha]

print("Rows where 'Product Name' starts with 'Alpha':")
print(df_starts_with_Alpha)

Output:

Boolean mask for 'starts with Alpha':
0 True
1 False
2 False
3 False
4 False
Name: Product Name, dtype: bool
Rows where 'Product Name' starts with 'Alpha':
ID Product Name Description
0 X101 Alpha Widget High-quality alpha series

Case-Insensitive Matching (case=False or flags=re.IGNORECASE)

To match "Alpha" or "alpha":

  • Using case=False:

    import pandas as pd

    df = pd.DataFrame({
    'ID': ['X101', 'Y203A', 'X105B', 'Z300', 'Y203C'],
    'Product Name': ['Alpha Widget', 'Beta Gadget', 'alpha gizmo', 'Gamma Unit', 'Delta Device'],
    'Description': ['High-quality alpha series', 'Advanced beta model with extras', 'Compact gizmo for Alphas', 'Standard gamma component', 'Next-gen delta system']
    })

    regex_starts_with_alpha_case_insensitive = r'^alpha' # Pattern is lowercase

    df_starts_alpha_case_false = df[df['Product Name'].str.contains(
    regex_starts_with_alpha_case_insensitive, case=False
    )]
    print("Rows where 'Product Name' starts with 'alpha' (case=False):")
    print(df_starts_alpha_case_false)

    Output:

    Rows where 'Product Name' starts with 'alpha' (case=False):
    ID Product Name Description
    0 X101 Alpha Widget High-quality alpha series
    2 X105B alpha gizmo Compact gizmo for Alphas
  • Using flags=re.IGNORECASE: (Requires import re)

    import pandas as pd
    import re # Import the re module for flags

    df = pd.DataFrame({
    'ID': ['X101', 'Y203A', 'X105B', 'Z300', 'Y203C'],
    'Product Name': ['Alpha Widget', 'Beta Gadget', 'alpha gizmo', 'Gamma Unit', 'Delta Device'],
    'Description': ['High-quality alpha series', 'Advanced beta model with extras', 'Compact gizmo for Alphas', 'Standard gamma component', 'Next-gen delta system']
    })

    regex_starts_with_alpha_any_case = r'^alpha' # Pattern can be any case here too

    df_starts_alpha_re_flag = df[df['Product Name'].str.contains(
    regex_starts_with_alpha_any_case, flags=re.IGNORECASE
    )]
    print("Rows where 'Product Name' starts with 'alpha' (flags=re.IGNORECASE):")
    print(df_starts_alpha_re_flag)

    Output: (Same as with case=False)

    Rows where 'Product Name' starts with 'alpha' (flags=re.IGNORECASE):
    ID Product Name Description
    0 X101 Alpha Widget High-quality alpha series
    2 X105B alpha gizmo Compact gizmo for Alphas
note

Both case=False and flags=re.IGNORECASE achieve case-insensitivity when regex=True (the default).

Filtering Rows Where a Column Does NOT Match a Regex (Negation)

To select rows that do not match the regex pattern, apply the tilde operator (~) to the boolean mask generated by str.contains(). This inverts the boolean Series.

import pandas as pd

df = pd.DataFrame({
'ID': ['X101', 'Y203A', 'X105B', 'Z300', 'Y203C'],
'Product Name': ['Alpha Widget', 'Beta Gadget', 'alpha gizmo', 'Gamma Unit', 'Delta Device'],
'Description': ['High-quality alpha series', 'Advanced beta model with extras', 'Compact gizmo for Alphas', 'Standard gamma component', 'Next-gen delta system']
})

regex_starts_with_Alpha = r'^Alpha'
mask_starts_with_Alpha = df['Product Name'].str.contains(regex_starts_with_Alpha)

# ✅ Use ~ to get rows NOT starting with 'Alpha'
df_not_starts_with_Alpha = df[~mask_starts_with_Alpha]

print("Rows where 'Product Name' does NOT start with 'Alpha':")
print(df_not_starts_with_Alpha)

Output:

Rows where 'Product Name' does NOT start with 'Alpha':
ID Product Name Description
1 Y203A Beta Gadget Advanced beta model with extras
2 X105B alpha gizmo Compact gizmo for Alphas
3 Z300 Gamma Unit Standard gamma component
4 Y203C Delta Device Next-gen delta system

Alternative: Using Built-in String Methods (When Regex is Overkill)

For simple checks like "starts with," "ends with," or "contains literal substring," using dedicated Pandas string methods can be more readable and potentially slightly more performant than a general regex.

import pandas as pd

df = pd.DataFrame({
'ID': ['X101', 'Y203A', 'X105B', 'Z300', 'Y203C'],
'Product Name': ['Alpha Widget', 'Beta Gadget', 'alpha gizmo', 'Gamma Unit', 'Delta Device'],
'Description': ['High-quality alpha series', 'Advanced beta model with extras', 'Compact gizmo for Alphas', 'Standard gamma component', 'Next-gen delta system']
})

# Using str.startswith() - case-sensitive by default
df_startswith_builtin = df[df['Product Name'].str.startswith('Alpha')]
print("Using str.startswith('Alpha'):")
print(df_startswith_builtin)
print()

# Using str.endswith() - case-sensitive by default
# Let's find descriptions ending with "extras"
df_endswith_builtin = df[df['Description'].str.endswith('extras')]
print("Using str.endswith('extras'):")
print(df_endswith_builtin)

Output:

Using str.startswith('Alpha'):
ID Product Name Description
0 X101 Alpha Widget High-quality alpha series

Using str.endswith('extras'):
ID Product Name Description
1 Y203A Beta Gadget Advanced beta model with extras
note
  • Use these simpler methods when your pattern doesn't require the full power of regex.
  • For case-insensitive with built-ins, often convert column to lower/upper first:
    df_startswith_alpha_case_insensitive_builtin = df[df['Product Name'].str.lower().str.startswith('alpha')]
    print("\nUsing str.lower().str.startswith('alpha'):")
    print(df_startswith_alpha_case_insensitive_builtin)

Specific Regex Use Cases

Selecting Rows Containing a Substring Pattern

To find rows where 'Description' contains "model" or "Model":

import pandas as pd
import re

df = pd.DataFrame({
'ID': ['X101', 'Y203A', 'X105B', 'Z300', 'Y203C'],
'Product Name': ['Alpha Widget', 'Beta Gadget', 'alpha gizmo', 'Gamma Unit', 'Delta Device'],
'Description': ['High-quality alpha series', 'Advanced beta model with extras', 'Compact gizmo for Alphas', 'Standard gamma component', 'Next-gen delta system']
})

regex_contains_model = r'model' # Simple regex, effectively a substring search
# For regex, . means any char, * means 0 or more occurrences.
# So r'.*model.*' would also work, but .str.contains is substring by default.

df_contains_model = df[df['Description'].str.contains(regex_contains_model, flags=re.IGNORECASE)]
print("Rows where 'Description' contains 'model' (case-insensitive):")
print(df_contains_model)

Output:

Rows where 'Description' contains 'model' (case-insensitive):
ID Product Name Description
1 Y203A Beta Gadget Advanced beta model with extras

Selecting Rows Ending With a Substring Pattern

To find rows where 'ID' ends with a letter followed by a number (e.g., 'A1', 'B2'):

import pandas as pd

df = pd.DataFrame({
'ID': ['X101', 'Y203A', 'X105B', 'Z300', 'Y203C'],
'Product Name': ['Alpha Widget', 'Beta Gadget', 'alpha gizmo', 'Gamma Unit', 'Delta Device'],
'Description': ['High-quality alpha series', 'Advanced beta model with extras', 'Compact gizmo for Alphas', 'Standard gamma component', 'Next-gen delta system']
})

regex_ends_letter_num = r'[A-Za-z]\d$' # [A-Za-z] for one letter, \d for one digit, $ for end
df_ends_letter_num = df[df['ID'].str.contains(regex_ends_letter_num)]
print("Rows where 'ID' ends with LetterNumber:")
print(df_ends_letter_num)

Output:

Rows where 'ID' ends with LetterNumber:
Empty DataFrame
Columns: [ID, Product Name, Description]
Index: []

Conclusion

Filtering Pandas DataFrame rows using regular expressions is a powerful technique for selecting data based on complex string patterns.

  • The primary method is df['ColumnName'].str.contains(your_regex_pattern).
  • This returns a boolean Series, which is then used for indexing the DataFrame: df[boolean_mask].
  • Use the case=False parameter or flags=re.IGNORECASE (after import re) within str.contains() for case-insensitive matching.
  • Use the tilde operator (~) to negate the boolean mask and select rows that do not match the regex.
  • For simple "starts with," "ends with," or literal substring checks, Pandas built-in string methods like str.startswith(), str.endswith(), or str.contains(literal_string, regex=False) can be more direct and readable alternatives.

By mastering str.contains() with regex, you gain significant control over how you filter and subset your textual data in Pandas.