Python Pandas: How to Select DataFrame Columns Starting With a Specific String
When working with Pandas DataFrames, especially those with many columns or dynamically generated names, you often need to select a subset of columns based on a naming pattern. A common requirement is to select all columns whose names begin with a particular prefix or string.
This guide explains several effective methods to select Pandas DataFrame columns that start with a given string, using list comprehensions, Index.str.startswith()
, and DataFrame.filter(regex=...)
.
The Goal: Selecting Columns by Prefix
Given a Pandas DataFrame, we want to create a new DataFrame (or a view) that contains only those columns whose names (headers) begin with a specified string pattern. For example, selecting all columns that start with "Sales_" or "User_".
Example DataFrame
import pandas as pd
data = {
'CustomerID': [101, 102, 103, 104],
'CustomerName': ['Alice WidgetCorp', 'Bob Solutions Ltd', 'Charlie Innovations', 'Diana Services Inc'],
'Region': ['North', 'South', 'North', 'West'],
'Sales_Q1': [1500, 2200, 1800, 2500],
'Sales_Q2': [1700, 2000, 1900, 2800],
'Support_Tickets': [2, 1, 3, 1],
'sales_rep': ['John', 'Jane', 'Mike', 'Anna'] # Lowercase 'sales'
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
CustomerID CustomerName Region Sales_Q1 Sales_Q2 \
0 101 Alice WidgetCorp North 1500 1700
1 102 Bob Solutions Ltd South 2200 2000
2 103 Charlie Innovations North 1800 1900
3 104 Diana Services Inc West 2500 2800
Support_Tickets sales_rep
0 2 John
1 1 Jane
2 3 Mike
3 1 Anna
Let's say we want to select all columns starting with "Customer" or "Sales".
Method 1: Using List Comprehension with str.startswith()
(Pythonic & Clear)
This approach iterates through the column names and uses the built-in string method startswith()
.
Case-Sensitive Selection
import pandas as pd
df_example = pd.DataFrame({
'CustomerID': [101], 'CustomerName': ['Alice'], 'Region': ['North'],
'Sales_Q1': [1500], 'Sales_Q2': [1700], 'Support_Tickets': [2]
})
prefix_to_match = 'Sales'
# ✅ Create a list of column names that start with the prefix
selected_column_names_lc = [
col_name for col_name in df_example.columns
if col_name.startswith(prefix_to_match)
]
print(f"Column names starting with '{prefix_to_match}': {selected_column_names_lc}\n")
# Use this list to select the columns from the DataFrame
df_selected_lc = df_example[selected_column_names_lc]
print(f"DataFrame with columns starting with '{prefix_to_match}':")
print(df_selected_lc)
Output:
Column names starting with 'Sales': ['Sales_Q1', 'Sales_Q2']
DataFrame with columns starting with 'Sales':
Sales_Q1 Sales_Q2
0 1500 1700
Case-Insensitive Selection
To perform a case-insensitive match, convert both the column name and the prefix to a consistent case (e.g., lowercase) before comparison.
import pandas as pd
df_example = pd.DataFrame({
'Sales_Q1': [1500], 'sales_rep': ['John'], 'CustomerID': [101]
})
prefix_lower = 'sales' # Search for 'sales' case-insensitively
selected_column_names_lc_ci = [
col_name for col_name in df_example.columns
if col_name.lower().startswith(prefix_lower)
]
print(f"Column names starting with '{prefix_lower}' (case-insensitive): {selected_column_names_lc_ci}\n")
df_selected_lc_ci = df_example[selected_column_names_lc_ci]
print(f"DataFrame with columns starting with '{prefix_lower}' (case-insensitive):")
print(df_selected_lc_ci)
Output:
Column names starting with 'sales' (case-insensitive): ['Sales_Q1', 'sales_rep']
DataFrame with columns starting with 'sales' (case-insensitive):
Sales_Q1 sales_rep
0 1500 John
Method 2: Using df.columns.str.startswith()
with df.loc
or Direct Indexing (Vectorized)
Pandas Index
objects (like df.columns
) have a .str
accessor that provides vectorized string methods.
Creating a Boolean Mask
df.columns.str.startswith(prefix)
returns a boolean array indicating which column names start with the prefix.
import pandas as pd
df_example = pd.DataFrame({
'CustomerID': [101], 'CustomerName': ['Alice'], 'Sales_Q1': [1500]
})
prefix_to_match = 'Customer'
# ✅ Create a boolean mask for column names
column_mask = df_example.columns.str.startswith(prefix_to_match)
print(f"Boolean mask for columns starting with '{prefix_to_match}':")
print(column_mask)
Output:
Boolean mask for columns starting with 'Customer':
[ True True False]
Applying the Mask with .loc
(Recommended)
Use df.loc[:, column_mask]
to select all rows (:
) and the columns where the mask is True
.
import pandas as pd
df_example = pd.DataFrame({
'CustomerID': [101], 'CustomerName': ['Alice'], 'Sales_Q1': [1500]
})
prefix_to_match = 'Customer'
column_mask = df_example.columns.str.startswith(prefix_to_match)
# ✅ Select columns using .loc and the boolean mask
df_selected_loc = df_example.loc[:, column_mask]
print(f"DataFrame with columns starting with '{prefix_to_match}' (using .loc):")
print(df_selected_loc)
Output:
DataFrame with columns starting with 'Customer' (using .loc):
CustomerID CustomerName
0 101 Alice
Applying the Mask with Direct Bracket Indexing
You can also use the boolean mask directly with df[df.columns[column_mask]]
.
import pandas as pd
df_selected_direct = df_example[df_example.columns[column_mask]]
print(f"DataFrame with columns starting with '{prefix_to_match}' (direct indexing):")
print(df_selected_direct) # Same output as .loc
Output:
DataFrame with columns starting with 'Customer' (using .loc):
CustomerID CustomerName
0 101 Alice
While df.loc[:, column_mask]
is generally preferred for clarity and explicit indexing, df[df.columns[column_mask]]
also works because df.columns[column_mask]
returns an Index of the selected column names.
For case-insensitive matching with str.startswith()
on an Index, you'd need to convert the index to a Series or list first to use .lower()
, or use flags=re.IGNORECASE
if using str.contains()
with a regex pattern (see Method 3).
Method 3: Using DataFrame.filter(regex=...)
(Powerful for Patterns)
The DataFrame.filter()
method can subset rows or columns. Using its regex
parameter with axis=1
allows powerful pattern matching on column names. To match columns starting with a string, use the caret ^
in the regex to anchor the pattern to the beginning of the name.
import pandas as pd
df_example = pd.DataFrame({
'CustomerID': [101], 'CustomerName': ['Alice'], 'Sales_Q1': [1500], 'sales_rep': ['John']
})
prefix_to_match_exact_case = 'Sales'
# Regex: r'^Sales' matches columns that begin with "Sales" (case-sensitive)
df_selected_filter_cs = df_example.filter(regex=r'^' + prefix_to_match_exact_case, axis=1)
print(f"Columns starting with '{prefix_to_match_exact_case}' (filter, case-sensitive):")
print(df_selected_filter_cs)
print()
# For case-insensitive matching with regex, use the (?i) flag in the pattern
prefix_any_case = 'sales'
df_selected_filter_ci = df_example.filter(regex=r'(?i)^' + prefix_any_case, axis=1)
# Or, you can use re.IGNORECASE if constructing pattern carefully:
# import re
# df_selected_filter_ci_re = df.filter(regex=re.compile(r'^' + prefix_any_case, flags=re.IGNORECASE), axis=1)
print(f"Columns starting with '{prefix_any_case}' (filter, case-insensitive):")
print(df_selected_filter_ci)
Output:
Columns starting with 'Sales' (filter, case-sensitive):
Sales_Q1
0 1500
Columns starting with 'sales' (filter, case-insensitive):
Sales_Q1 sales_rep
0 1500 John
axis=1
: Specifies that the filter should be applied to column names.regex=r'^Prefix'
: The^
asserts the start of the string.regex=r'(?i)^Prefix'
: The(?i)
flag within the regex pattern makes the match case-insensitive.
Choosing the Right Method
- List comprehension with
col.startswith(prefix)
: Very Pythonic, clear, and good for simple prefix matching. Easy to adapt for case-insensitivity. df.columns.str.startswith(prefix)
then.loc
: Vectorized and efficient. Good for boolean mask operations. Case-insensitivity is a bit more involved directly on the Index'sstr.startswith
.df.filter(regex=r'^Prefix', axis=1)
: Most powerful if you need more complex starting patterns beyond a simple prefix (e.g., "starts with 'Sales_' followed by a digit"). Handles case-insensitivity well with regex flags.
For simple prefix matching, the list comprehension or df.columns.str.startswith()
are often preferred for readability. For more complex patterns, filter(regex=...)
excels.
Conclusion
Selecting Pandas DataFrame columns that start with a specific string can be achieved in several ways:
- Use a list comprehension iterating over
df.columns
withcol_name.startswith(prefix)
. - Create a boolean mask using
df.columns.str.startswith(prefix)
and apply it withdf.loc[:, mask]
. - Employ
df.filter(regex=r'^' + prefix, axis=1)
for a concise and powerful regex-based solution.
Remember to handle case sensitivity as needed by converting strings to a consistent case or using appropriate flags/parameters (case=False
for str.contains
if used, or re.IGNORECASE
/(?i)
for regex).
These methods provide robust control over selecting columns based on their naming patterns.