Python Pandas: How to Drop DataFrame Columns if Name Contains String (Regex/str.contains)
When working with Pandas DataFrames, especially those with many columns or dynamically generated column names, you often need to drop columns based on patterns in their names. A common requirement is to remove all columns whose names contain a specific substring or match a regular expression.
This guide explains how to effectively drop Pandas DataFrame columns if their names contain a given string or match a regex, using DataFrame.filter()
with regex
and Index.str.contains()
.
The Goal: Dropping Columns by Name Patterns
Given a Pandas DataFrame, we want to identify and remove columns where the column's name (header) includes a specific substring (e.g., "temp", "_calculated") or matches a more complex regular expression pattern.
Example DataFrame
import pandas as pd
import re # For re.IGNORECASE if needed
data = {
'Record_ID': [101, 102, 103],
'Patient_Name': ['Alice', 'Bob', 'Charlie'],
'Age_Years': [30, 45, 28],
'Vital_Temp_C': [36.5, 37.0, 36.8],
'Vital_Pressure_Systolic': [120, 130, 115],
'Notes_General': ['Stable', 'Observation', 'Stable'],
'Lab_Test_Result_A': [1.5, 2.1, 0.9],
'Lab_Test_Flag_A': [False, True, False]
}
df_original = pd.DataFrame(data)
print("Original DataFrame:")
print(df_original)
Output:
Original DataFrame:
Record_ID Patient_Name Age_Years Vital_Temp_C Vital_Pressure_Systolic \
0 101 Alice 30 36.5 120
1 102 Bob 45 37.0 130
2 103 Charlie 28 36.8 115
Notes_General Lab_Test_Result_A Lab_Test_Flag_A
0 Stable 1.5 False
1 Observation 2.1 True
2 Stable 0.9 False
Let's say we want to drop columns related to "Vital"
signs or those containing "Test"
.
Method 1: Using DataFrame.filter(regex=...)
and DataFrame.drop()
(Recommended)
The DataFrame.filter()
method can subset rows or columns based on labels. When used with the regex
parameter and axis=1
(for columns), it selects columns whose names match the regular expression.
Dropping Columns Containing a Single String
To drop all columns whose names contain the substring "Vital":
import pandas as pd
df = pd.DataFrame({
'Record_ID': [101, 102, 103], 'Patient_Name': ['Alice', 'Bob', 'Charlie'],
'Age_Years': [30, 45, 28], 'Vital_Temp_C': [36.5, 37.0, 36.8],
'Vital_Pressure_Systolic': [120, 130, 115], 'Notes_General': ['Stable', 'Observation', 'Stable']
})
# Step 1: Filter to get the names of columns containing 'Vital'
cols_to_drop_filter = df.filter(regex='Vital', axis=1).columns
print(f"Columns to drop (containing 'Vital'): {list(cols_to_drop_filter)}")
# Step 2: Drop these columns
# df.drop(columns=cols_to_drop_filter, inplace=True) # Modifies df directly
df_dropped_vital = df.drop(columns=cols_to_drop_filter) # Returns a new DataFrame
print("DataFrame after dropping columns containing 'Vital':")
print(df_dropped_vital)
Output:
Columns to drop (containing 'Vital'): ['Vital_Temp_C', 'Vital_Pressure_Systolic']
DataFrame after dropping columns containing 'Vital':
Record_ID Patient_Name Age_Years Notes_General
0 101 Alice 30 Stable
1 102 Bob 45 Observation
2 103 Charlie 28 Stable
df.filter(regex='Vital', axis=1)
: Selects a DataFrame containing only columns whose names match the regex 'Vital' (which acts as a substring search here)..columns
: Gets the column names from this filtered DataFrame.df.drop(columns=..., axis=1)
: Drops the specified columns.axis=1
is crucial for dropping columns. Useinplace=True
to modify the original DataFrame directly.- You can do this in one line:
df.drop(list(df.filter(regex='Vital', axis=1)), axis=1, inplace=True)
Dropping Columns Containing One of Multiple Strings (OR Logic)
Use the pipe |
character in your regex to specify OR logic. To drop columns containing "Vital" OR "Lab":
import pandas as pd
df = pd.DataFrame({
'Record_ID': [101], 'Patient_Name': ['Alice'], 'Age_Years': [30],
'Vital_Temp_C': [36.5], 'Vital_Pressure_Systolic': [120],
'Lab_Test_Result_A': [1.5], 'Lab_Test_Flag_A': [False], 'Notes_General': ['Stable']
})
regex_pattern_multi = 'Vital|Lab' # Match 'Vital' OR 'Lab'
cols_to_drop_multi = df.filter(regex=regex_pattern_multi, axis=1).columns
print(f"Columns to drop (containing 'Vital' or 'Lab'): {list(cols_to_drop_multi)}\n")
df_dropped_multi = df.drop(columns=cols_to_drop_multi)
print("DataFrame after dropping columns containing 'Vital' or 'Lab':")
print(df_dropped_multi)
Output:
Columns to drop (containing 'Vital' or 'Lab'): ['Vital_Temp_C', 'Vital_Pressure_Systolic', 'Lab_Test_Result_A', 'Lab_Test_Flag_A']
DataFrame after dropping columns containing 'Vital' or 'Lab':
Record_ID Patient_Name Age_Years Notes_General
0 101 Alice 30 Stable
Method 2: Using Index.str.contains()
and DataFrame.loc
or drop()
This method involves creating a boolean mask from the DataFrame's column names.
Creating a Boolean Mask for Column Names
The df.columns.str.contains(pattern)
method returns a boolean array indicating which column names contain the pattern.
import pandas as pd
df = pd.DataFrame({
'Record_ID': [101], 'Patient_Name': ['Alice'], 'Age_Years': [30],
'Vital_Temp_C': [36.5], 'Vital_Pressure_Systolic': [120], 'Notes_General': ['Stable']
})
substring_to_find = 'name' # Case-sensitive search for 'name'
# Create a boolean mask for column names containing the substring
cols_contain_name_mask = df.columns.str.contains(substring_to_find)
print(f"Boolean mask for columns containing '{substring_to_find}':")
print(cols_contain_name_mask)
Output:
Boolean mask for columns containing 'name':
[False False False False False False]
Selecting Columns to Keep (Inverting the Logic)
To drop columns matching the pattern, you select the columns that do not match. Use the tilde (~
) operator to invert the boolean mask.
import pandas as pd
df = pd.DataFrame({
'Record_ID': [101], 'Patient_Name': ['Alice'], 'Age_Years': [30],
'Vital_Temp_C': [36.5], 'Vital_Pressure_Systolic': [120], 'Notes_General': ['Stable']
})
substring_to_find = 'Name' # Example: to match 'Patient_Name'
cols_contain_name_mask = df.columns.str.contains(substring_to_find)
# ✅ Select columns to KEEP (where mask is False, i.e., name does NOT contain 'Name')
df_kept_cols = df.loc[:, ~cols_contain_name_mask]
print(f"DataFrame after keeping columns NOT containing '{substring_to_find}':")
print(df_kept_cols)
Output (Patient_Name column is dropped):
Dropping Columns Based on the Mask
Alternatively, get the names of columns to drop and use df.drop()
.
import pandas as pd
df = pd.DataFrame({
'Record_ID': [101], 'Patient_Name': ['Alice'], 'Age_Years': [30],
'Vital_Temp_C': [36.5], 'Vital_Pressure_Systolic': [120], 'Notes_General': ['Stable']
})
substring_to_find = 'Name'
cols_contain_name_mask = df.columns.str.contains(substring_to_find)
# Get actual column names to drop
cols_to_drop_str_contains = df.columns[cols_contain_name_mask]
print(f"Columns to drop using str.contains: {cols_to_drop_str_contains.tolist()}\n")
df_dropped_str_contains = df.drop(columns=cols_to_drop_str_contains)
print(f"DataFrame after dropping columns containing '{substring_to_find}' (using str.contains then drop):")
print(df_dropped_str_contains)
Output: (Same as selecting kept columns with ~mask)
Columns to drop using str.contains: ['Patient_Name']
DataFrame after dropping columns containing 'Name' (using str.contains then drop):
Record_ID Age_Years Vital_Temp_C Vital_Pressure_Systolic Notes_General
0 101 30 36.5 120 Stable
Case-Insensitive Matching with str.contains()
Set the case=False
parameter in str.contains()
. You can also use flags=re.IGNORECASE
.
import pandas as pd
import re # For re.IGNORECASE
df = pd.DataFrame({
'Record_ID': [101], 'Patient_Name': ['Alice'], 'Age_Years': [30], 'Notes_General': ['Stable']
})
substring_lower = 'name' # Search for 'name' case-insensitively
# ✅ Case-insensitive mask
cols_contain_name_case_insensitive_mask = df.columns.str.contains(substring_lower, case=False)
# OR: cols_contain_name_case_insensitive_mask = df.columns.str.contains(substring_lower, flags=re.IGNORECASE)
print(f"Boolean mask for columns containing '{substring_lower}' (case-insensitive):")
print(cols_contain_name_case_insensitive_mask)
df_kept_case_insensitive = df.loc[:, ~cols_contain_name_case_insensitive_mask]
print("DataFrame after case-insensitive drop:")
print(df_kept_case_insensitive)
Output:
Boolean mask for columns containing 'name' (case-insensitive):
[False True False False]
DataFrame after case-insensitive drop:
Record_ID Age_Years Notes_General
0 101 30 Stable
Alternative: Keeping Only Columns Whose Names Contain a String
If your goal is the inverse – to keep only the columns whose names contain a certain string – DataFrame.filter(like=..., axis=1)
or DataFrame.filter(regex=..., axis=1)
directly achieves this without needing drop()
.
import pandas as pd
df = pd.DataFrame({
'Record_ID': [101], 'Patient_Name': ['Alice'], 'Age_Years': [30],
'Vital_Temp_C': [36.5], 'Vital_Pressure_Systolic': [120], 'Notes_General': ['Stable']
})
# ✅ Keep only columns whose names contain 'Vital'
df_only_vital_cols = df.filter(like='Vital', axis=1)
# Or using regex: df_only_vital_cols = df.filter(regex='Vital', axis=1)
print("DataFrame with ONLY columns containing 'Vital':")
print(df_only_vital_cols)
Output:
DataFrame with ONLY columns containing 'Vital':
Vital_Temp_C Vital_Pressure_Systolic
0 36.5 120
Conclusion
To drop Pandas DataFrame columns based on whether their names contain a specific string or match a regex:
- Using
df.filter(regex=pattern, axis=1)
thendf.drop(columns=...)
:- Filter to get the list of column names matching your
pattern
(which can be a simple substring or a complex regex). - Pass this list of names to
df.drop(columns=...)
. This is generally clean and leverages regex power effectively.
- Filter to get the list of column names matching your
- Using
df.columns.str.contains(pattern)
thendf.loc[:, ~mask]
ordf.drop(columns=...)
:- Create a boolean mask from the column names using
str.contains()
. - Either select columns to keep using
df.loc[:, ~mask]
(inverting the mask). - Or get the names of columns to drop from the mask
df.columns[mask]
and usedf.drop()
. str.contains()
hascase
andflags
parameters for case-insensitivity.
- Create a boolean mask from the column names using
Choose the method that feels most readable and maintainable for your use case. The filter(regex=...)
approach combined with drop
is often very expressive for this task.