Python Pandas: How to Count Unique Combinations of Two (or More) Columns
Counting the occurrences of unique combinations of values across two or more columns in a Pandas DataFrame is a fundamental operation in data analysis. This helps in understanding the distribution and frequency of specific data patterns. For instance, you might want to count how many times each unique pair of 'Product Category' and 'Region' appears in a sales dataset.
This guide demonstrates common and effective methods to count unique combinations of column values in Pandas, primarily using groupby().size()
and DataFrame.value_counts()
.
The Goal: Counting Co-occurrences of Column Values
Given a Pandas DataFrame, we want to identify all unique combinations of values present in two (or more) specified columns and then count how many times each unique combination appears in the DataFrame. The result is typically a new DataFrame or Series showing these combinations and their corresponding counts.
Example DataFrame
import pandas as pd
import numpy as np # For NaN example
data = {
'UserID': [101, 102, 101, 103, 102, 101, 104, 101],
'Product_Category': ['Electronics', 'Books', 'Electronics', 'Home', 'Books', 'Apparel', 'Home', 'Electronics'],
'Region': ['North', 'South', 'North', 'West', 'North', 'North', 'West', 'East'],
'Action': ['View', 'Purchase', 'View', 'Purchase', 'View', 'Purchase', 'View', 'View']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
UserID Product_Category Region Action
0 101 Electronics North View
1 102 Books South Purchase
2 101 Electronics North View
3 103 Home West Purchase
4 102 Books North View
5 101 Apparel North Purchase
6 104 Home West View
7 101 Electronics East View
Let's say we want to count the unique combinations of 'Product_Category'
and 'Region'
.
Method 1: Using DataFrame.groupby()
and size()
(Recommended)
This is a very flexible and widely used approach.
Getting Counts as a Series (MultiIndex)
Group by the columns of interest and then use .size()
to get the number of rows in each group. The result is a Series with a MultiIndex formed by the unique combinations of the grouping columns.
import pandas as pd
df_example = pd.DataFrame({
'UserID': [101, 102, 101, 103, 102, 101, 104, 101],
'Product_Category': ['Electronics', 'Books', 'Electronics', 'Home', 'Books', 'Apparel', 'North', 'Electronics'],
'Region': ['North', 'South', 'North', 'West', 'North', 'North', 'West', 'East'],
})
# Group by 'Product_Category' and 'Region', then get the size of each group
combination_counts_series = df_example.groupby(['Product_Category', 'Region']).size()
print("Counts of unique combinations (Product_Category, Region) as a Series:")
print(combination_counts_series)
Output:
Counts of unique combinations (Product_Category, Region) as a Series:
Product_Category Region
Apparel North 1
Books North 1
South 1
Electronics East 1
North 2
Home West 1
North West 1
dtype: int64
Converting to a DataFrame with reset_index()
and rename()
To transform this Series into a DataFrame with regular columns:
reset_index()
: Converts the MultiIndex levels into columns.rename()
: Renames the default column name for counts (which is0
) to something more descriptive like 'count'.
import pandas as pd
df_example = pd.DataFrame({
'UserID': [101, 102, 101, 103, 102, 101, 104, 101],
'Product_Category': ['Electronics', 'Books', 'Electronics', 'Home', 'Books', 'Apparel', 'North', 'Electronics'],
'Region': ['North', 'South', 'North', 'West', 'North', 'North', 'West', 'East'],
})
combination_counts_series = df_example.groupby(['Product_Category', 'Region']).size()
# Convert to DataFrame
combination_counts_df = combination_counts_series.reset_index()
print("After reset_index():")
print(combination_counts_df) # Column for counts is named 0
print()
# Rename the count column
combination_counts_df = combination_counts_df.rename(columns={0: 'Count'})
print("Counts of unique combinations as a DataFrame:")
print(combination_counts_df)
Output:
After reset_index():
Product_Category Region 0
0 Apparel North 1
1 Books North 1
2 Books South 1
3 Electronics East 1
4 Electronics North 2
5 Home West 1
6 North West 1
Counts of unique combinations as a DataFrame:
Product_Category Region Count
0 Apparel North 1
1 Books North 1
2 Books South 1
3 Electronics East 1
4 Electronics North 2
5 Home West 1
6 North West 1
Using reset_index(name='Count')
directly is a more concise way to achieve the reset and rename in one step.
Using as_index=False
in groupby()
If you set as_index=False
in the groupby()
call, the grouping columns are kept as regular columns, and size()
will produce a DataFrame directly, though the count column might still need renaming (it's often named size
).
import pandas as pd
df_example = pd.DataFrame({
'UserID': [101, 102, 101, 103, 102, 101, 104, 101],
'Product_Category': ['Electronics', 'Books', 'Electronics', 'Home', 'Books', 'Apparel', 'North', 'Electronics'],
'Region': ['North', 'South', 'North', 'West', 'North', 'North', 'West', 'East'],
})
# Use as_index=False
combination_counts_df_as_index_false = df_example.groupby(['Product_Category', 'Region'], as_index=False).size()
print("Counts using groupby(as_index=False).size():")
print(combination_counts_df_as_index_false)
Output:
Counts using groupby(as_index=False).size():
Product_Category Region size
0 Apparel North 1
1 Books North 1
2 Books South 1
3 Electronics East 1
4 Electronics North 2
5 Home West 1
6 North West 1
Method 2: Using DataFrame.value_counts()
(Concise)
The DataFrame.value_counts(subset=None, normalize=False, sort=True, ascending=False, dropna=True)
method is excellent for directly counting unique row occurrences based on a subset of columns.
Getting Counts as a Series (MultiIndex)
By default, value_counts()
on a subset of columns returns a Series with a MultiIndex.
import pandas as pd
df_example = pd.DataFrame({
'UserID': [101, 102, 101, 103, 102, 101, 104, 101],
'Product_Category': ['Electronics', 'Books', 'Electronics', 'Home', 'Books', 'Apparel', 'North', 'Electronics'],
'Region': ['North', 'South', 'North', 'West', 'North', 'North', 'West', 'East'],
})
# Select the columns for which to count unique combinations
columns_for_value_counts = ['Product_Category', 'Region']
value_counts_series = df_example[columns_for_value_counts].value_counts()
print("Counts of unique combinations using value_counts() (Series):")
print(value_counts_series)
Output (order might vary due to default sorting by count):
Counts of unique combinations using value_counts() (Series):
Product_Category Region
Electronics North 2
Apparel North 1
Books North 1
South 1
Electronics East 1
Home West 1
North West 1
Name: count, dtype: int64
Converting to a DataFrame with reset_index(name='...')
Similar to the groupby().size()
result, you can use reset_index()
to convert this Series to a DataFrame.
import pandas as pd
df_example = pd.DataFrame({
'UserID': [101, 102, 101, 103, 102, 101, 104, 101],
'Product_Category': ['Electronics', 'Books', 'Electronics', 'Home', 'Books', 'Apparel', 'North', 'Electronics'],
'Region': ['North', 'South', 'North', 'West', 'North', 'North', 'West', 'East'],
})
columns_for_value_counts = ['Product_Category', 'Region']
value_counts_series = df_example[columns_for_value_counts].value_counts()
# Convert to DataFrame and name the count column
value_counts_df = value_counts_series.reset_index(name='Frequency')
print("Counts of unique combinations using value_counts() (DataFrame):")
print(value_counts_df)
Output:
Counts of unique combinations using value_counts() (DataFrame):
Product_Category Region Frequency
0 Electronics North 2
1 Apparel North 1
2 Books North 1
3 Books South 1
4 Electronics East 1
5 Home West 1
6 North West 1
Sorting Options
value_counts()
sorts the results by count in descending order by default. Use ascending=True
to sort in ascending order of counts, or sort=False
to keep the order of appearance (though this might be less predictable for combinations).
# Sort by count ascending
value_counts_asc = df_example[columns_for_value_counts].value_counts(ascending=True).reset_index(name='Count')
print("Value counts (ascending):")
print(value_counts_asc)
Handling NaN
Values (dropna
parameter)
By default, value_counts(dropna=True)
excludes combinations where any of the involved columns have a NaN
value. Set dropna=False
to include counts of combinations involving NaN
s.
import pandas as pd
import numpy as np
df_with_nan = pd.DataFrame({
'Product_Category': ['Electronics', 'Books', np.nan, 'Electronics', np.nan],
'Region': ['North', 'South', 'North', 'North', 'West']
})
print("Value counts with NaN (dropna=True, default):")
print(df_with_nan[['Product_Category', 'Region']].value_counts(dropna=True).reset_index(name='Count'))
print()
print("Value counts with NaN (dropna=False):")
print(df_with_nan[['Product_Category', 'Region']].value_counts(dropna=False).reset_index(name='Count'))
Output:
Value counts with NaN (dropna=True, default):
Product_Category Region Count
0 Electronics North 2
1 Books South 1
Value counts with NaN (dropna=False):
Product_Category Region Count
0 Electronics North 2
1 Books South 1
2 NaN North 1
3 NaN West 1
Choosing the Right Method
df.groupby(cols_list).size()
:- Pros: Very flexible, part of the powerful
groupby
framework. Allows for further chained aggregations if needed.as_index=False
can directly return a DataFrame. - Cons: Might require an explicit
reset_index()
andrename()
ifas_index=True
(default).
- Pros: Very flexible, part of the powerful
df[cols_list].value_counts()
:- Pros: Very concise and often more direct for simple unique combination counting. Has convenient
ascending
anddropna
parameters. Often names the count column by default (e.g., 'count' or 'proportion'). - Cons: Primarily for counting; less direct if you immediately need other aggregations on the groups.
- Pros: Very concise and often more direct for simple unique combination counting. Has convenient
For simply counting unique combinations of two or more columns, df[cols_list].value_counts().reset_index(name='YourCountName')
is often the most succinct and readable solution. groupby().size()
is more general if you plan to do more with the groups.
Conclusion
Counting the unique combinations of values across multiple columns in a Pandas DataFrame is essential for understanding data distributions and relationships.
- The
DataFrame.groupby(['ColA', 'ColB']).size()
method, followed byreset_index()
andrename()
, provides a flexible and robust way to achieve this. Usingas_index=False
ingroupby
can simplify this. - The
df[['ColA', 'ColB']].value_counts()
method offers a more concise and often more direct approach specifically tailored for counting unique row occurrences, easily convertible to a DataFrame usingreset_index(name='count_col_name')
.
Both methods are effective. Choose the one that you find clearer and that best fits into your broader data analysis workflow.
Remember to consider how NaN
values should be handled by using the dropna
parameter in value_counts()
or by preprocessing NaN
s before groupby()
.