Python Pandas: How to Filter DataFrame Rows by Value Counts in a Column
Filtering a Pandas DataFrame based on the frequency or count of values in a specific column is a common data analysis task. For instance, you might want to select only rows where a category appears more than N times, or identify items that are unique (appear only once).
This guide explains several effective methods to filter a Pandas DataFrame based on the value counts of a column, using groupby().filter()
, groupby().transform('size')
, and map()
with value_counts()
.
The Goal: Filtering Based on Frequency
We want to select rows from a DataFrame where the value in a particular column meets a certain frequency criterion. For example, keep only rows where the value in 'Category' appears at least 3 times in the entire DataFrame.
Example DataFrame
Consider this DataFrame:
import pandas as pd
data = {
'OrderID': [101, 102, 103, 104, 105, 106, 107, 108],
'Product': ['Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Monitor', 'Mouse', 'Laptop', 'Webcam'],
'Region': ['North', 'North', 'South', 'West', 'North', 'South', 'North', 'West'],
'Quantity': [1, 2, 1, 1, 2, 3, 1, 1]
}
df_original = pd.DataFrame(data)
print("Original DataFrame:")
print(df_original)
Let's say we want to filter rows based on the count of occurrences in the 'Product' column.
Method 1: Using groupby().filter()
(Recommended for Direct Filtering)
The DataFrameGroupBy.filter(func)
method allows you to filter out entire groups based on a function that returns True
or False
for each group.
import pandas as pd
df = pd.DataFrame({
'OrderID': [101, 102, 103, 104, 105, 106, 107, 108],
'Product': ['Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Monitor', 'Mouse', 'Laptop', 'Webcam'],
'Region': ['North', 'North', 'South', 'West', 'North', 'South', 'North', 'West'],
'Quantity': [1, 2, 1, 1, 2, 3, 1, 1]
})
# Filter for products that appear more than once
# The lambda function x represents each group (a sub-DataFrame)
# len(x) gives the number of rows in that group (i.e., the value count)
df_filtered_by_count = df.groupby('Product').filter(lambda group_df: len(group_df) > 1)
print("DataFrame filtered for products appearing > 1 time (using groupby().filter()):")
print(df_filtered_by_count)
Output:
DataFrame filtered for products appearing > 1 time (using groupby().filter()):
OrderID Product Region Quantity
0 101 Laptop North 1
1 102 Mouse North 2
3 104 Laptop West 1
5 106 Mouse South 3
6 107 Laptop North 1
df.groupby('Product')
: Groups the DataFrame by unique values in the 'Product' column..filter(lambda group_df: len(group_df) > 1)
: For each product group,len(group_df)
calculates its size (count). If the count is greater than 1, all rows belonging to that product group are kept; otherwise, they are filtered out.
Method 2: Using groupby().transform('size')
This method first calculates the size (count) of each group and "transforms" this count back to align with the original DataFrame's index. Then, it uses this new Series of counts for boolean indexing.
import pandas as pd
df = pd.DataFrame({
'OrderID': [101, 102, 103, 104, 105, 106, 107, 108],
'Product': ['Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Monitor', 'Mouse', 'Laptop', 'Webcam'],
'Region': ['North', 'North', 'South', 'West', 'North', 'South', 'North', 'West'],
'Quantity': [1, 2, 1, 1, 2, 3, 1, 1]
})
# Create a Series where each element is the count of its 'Product' group
product_counts = df.groupby('Product')['Product'].transform('size')
print("Product counts transformed back to original index:")
print(product_counts)
print()
# ✅ Use this Series to create a boolean mask for filtering
df_filtered_transform = df[product_counts > 1]
print("DataFrame filtered for products appearing > 1 time (using groupby().transform('size')):")
print(df_filtered_transform)
Output:
Product counts transformed back to original index:
0 3
1 2
2 1
3 3
4 1
5 2
6 3
7 1
Name: Product, dtype: int64
DataFrame filtered for products appearing > 1 time (using groupby().transform('size')):
OrderID Product Region Quantity
0 101 Laptop North 1
1 102 Mouse North 2
3 104 Laptop West 1
5 106 Mouse South 3
6 107 Laptop North 1
df.groupby('Product')['Product'].transform('size')
: Groups by 'Product', then for each group, calculates its size (count) and aligns this count with every row belonging to that group in the original DataFrame's shape.df[product_counts > 1]
: Standard boolean indexing.
Method 3: Using Series.map()
with value_counts()
This approach involves:
- Calculating the value counts for the target column using
Series.value_counts()
. This returns a Series where the index is the unique values and the values are their counts. - Using
Series.map()
to map these counts back to each row in the original DataFrame based on the value in the target column. - Filtering based on these mapped counts.
import pandas as pd
df = pd.DataFrame({
'OrderID': [101, 102, 103, 104, 105, 106, 107, 108],
'Product': ['Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Monitor', 'Mouse', 'Laptop', 'Webcam'],
'Region': ['North', 'North', 'South', 'West', 'North', 'South', 'North', 'West'],
'Quantity': [1, 2, 1, 1, 2, 3, 1, 1]
})
# Get value counts for the 'Product' column
product_value_counts = df['Product'].value_counts()
print("Value counts for 'Product':")
print(product_value_counts)
print()
# Map these counts back to the original DataFrame's 'Product' column
# This creates a Series aligned with df's index, containing the count for each product
mapped_counts = df['Product'].map(product_value_counts)
print("Mapped counts:")
print(mapped_counts)
print()
# ✅ Use the mapped counts to filter
df_filtered_map = df[mapped_counts > 1]
print("DataFrame filtered for products appearing > 1 time (using map() and value_counts()):")
print(df_filtered_map)
Output:
Value counts for 'Product':
Product
Laptop 3
Mouse 2
Keyboard 1
Monitor 1
Webcam 1
Name: count, dtype: int64
Mapped counts:
0 3
1 2
2 1
3 3
4 1
5 2
6 3
7 1
Name: Product, dtype: int64
DataFrame filtered for products appearing > 1 time (using map() and value_counts()):
OrderID Product Region Quantity
0 101 Laptop North 1
1 102 Mouse North 2
3 104 Laptop West 1
5 106 Mouse South 3
6 107 Laptop North 1
Adding a Value Count Column (For Context or Later Filtering)
If you want to add the count as a new column before filtering (or just for informational purposes), groupby().transform('size')
or map()
are suitable.
import pandas as pd
df = pd.DataFrame({
'OrderID': [101, 102, 103, 104, 105, 106, 107, 108],
'Product': ['Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Monitor', 'Mouse', 'Laptop', 'Webcam'],
'Region': ['North', 'North', 'South', 'West', 'North', 'South', 'North', 'West'],
'Quantity': [1, 2, 1, 1, 2, 3, 1, 1]
})
# Add 'Product_Count' column using transform
df['Product_Count_Transform'] = df.groupby('Product')['Product'].transform('size')
# Add 'Product_Count' column using map and value_counts
# product_value_counts = df['Product'].value_counts()
# df['Product_Count_Map'] = df['Product'].map(product_value_counts)
print("DataFrame with 'Product_Count' column (using transform):")
print(df)
print()
# Now you can filter using this new column
df_filtered_from_new_col = df[df['Product_Count_Transform'] > 1]
print("Filtered based on the new count column:")
print(df_filtered_from_new_col.drop(columns=['Product_Count_Transform'])) # Drop count col for same output as others
Output:
DataFrame with 'Product_Count' column (using transform):
OrderID Product Region Quantity Product_Count_Transform
0 101 Laptop North 1 3
1 102 Mouse North 2 2
2 103 Keyboard South 1 1
3 104 Laptop West 1 3
4 105 Monitor North 2 1
5 106 Mouse South 3 2
6 107 Laptop North 1 3
7 108 Webcam West 1 1
Filtered based on the new count column:
OrderID Product Region Quantity
0 101 Laptop North 1
1 102 Mouse North 2
3 104 Laptop West 1
5 106 Mouse South 3
6 107 Laptop North 1
Choosing the Right Method
df.groupby('col').filter(lambda group: condition_on_group_size)
: Most direct and often very readable for filtering out groups that don't meet a size criterion. The condition is applied to the group as a whole.df[df.groupby('col')['col'].transform('size') > N]
: Very effective and Pythonic. It computes the group size for each row first, then uses simple boolean indexing. This is often preferred for its clarity when the condition is a simple comparison against a count.df[df['col'].map(df['col'].value_counts()) > N]
: Works well and can be intuitive. It requires two steps (value_counts then map). Performance might be slightly less optimal thantransform
for very large DataFrames but is generally good.
All three methods achieve the goal. groupby().transform()
is often a strong contender for adding the count back or for direct boolean masking. groupby().filter()
is excellent when the logic for keeping/discarding a group is based on its properties (like size).
Conclusion
Filtering a Pandas DataFrame by the value counts of a column allows you to focus on more frequent or rare occurrences.
groupby().filter()
provides a direct way to filter groups based on their size.groupby().transform('size')
creates a Series of group sizes aligned with the original DataFrame, enabling easy boolean indexing.Series.map()
withvalue_counts()
is another effective method to achieve similar results by mapping pre-calculated counts.
Choose the method that you find most readable and that best fits the structure of your filtering logic. These techniques are essential tools for data cleaning, exploratory data analysis, and feature engineering in Pandas.