Skip to main content

Python Pandas: How to Count Unique Combinations of Two (or More) Columns

Counting the occurrences of unique combinations of values across two or more columns in a Pandas DataFrame is a fundamental operation in data analysis. This helps in understanding the distribution and frequency of specific data patterns. For instance, you might want to count how many times each unique pair of 'Product Category' and 'Region' appears in a sales dataset.

This guide demonstrates common and effective methods to count unique combinations of column values in Pandas, primarily using groupby().size() and DataFrame.value_counts().

The Goal: Counting Co-occurrences of Column Values

Given a Pandas DataFrame, we want to identify all unique combinations of values present in two (or more) specified columns and then count how many times each unique combination appears in the DataFrame. The result is typically a new DataFrame or Series showing these combinations and their corresponding counts.

Example DataFrame

import pandas as pd
import numpy as np # For NaN example

data = {
'UserID': [101, 102, 101, 103, 102, 101, 104, 101],
'Product_Category': ['Electronics', 'Books', 'Electronics', 'Home', 'Books', 'Apparel', 'Home', 'Electronics'],
'Region': ['North', 'South', 'North', 'West', 'North', 'North', 'West', 'East'],
'Action': ['View', 'Purchase', 'View', 'Purchase', 'View', 'Purchase', 'View', 'View']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Output:

Original DataFrame:
UserID Product_Category Region Action
0 101 Electronics North View
1 102 Books South Purchase
2 101 Electronics North View
3 103 Home West Purchase
4 102 Books North View
5 101 Apparel North Purchase
6 104 Home West View
7 101 Electronics East View

Let's say we want to count the unique combinations of 'Product_Category' and 'Region'.

This is a very flexible and widely used approach.

Getting Counts as a Series (MultiIndex)

Group by the columns of interest and then use .size() to get the number of rows in each group. The result is a Series with a MultiIndex formed by the unique combinations of the grouping columns.

import pandas as pd

df_example = pd.DataFrame({
'UserID': [101, 102, 101, 103, 102, 101, 104, 101],
'Product_Category': ['Electronics', 'Books', 'Electronics', 'Home', 'Books', 'Apparel', 'North', 'Electronics'],
'Region': ['North', 'South', 'North', 'West', 'North', 'North', 'West', 'East'],
})

# Group by 'Product_Category' and 'Region', then get the size of each group
combination_counts_series = df_example.groupby(['Product_Category', 'Region']).size()

print("Counts of unique combinations (Product_Category, Region) as a Series:")
print(combination_counts_series)

Output:

Counts of unique combinations (Product_Category, Region) as a Series:
Product_Category Region
Apparel North 1
Books North 1
South 1
Electronics East 1
North 2
Home West 1
North West 1
dtype: int64

Converting to a DataFrame with reset_index() and rename()

To transform this Series into a DataFrame with regular columns:

  1. reset_index(): Converts the MultiIndex levels into columns.
  2. rename(): Renames the default column name for counts (which is 0) to something more descriptive like 'count'.
import pandas as pd

df_example = pd.DataFrame({
'UserID': [101, 102, 101, 103, 102, 101, 104, 101],
'Product_Category': ['Electronics', 'Books', 'Electronics', 'Home', 'Books', 'Apparel', 'North', 'Electronics'],
'Region': ['North', 'South', 'North', 'West', 'North', 'North', 'West', 'East'],
})
combination_counts_series = df_example.groupby(['Product_Category', 'Region']).size()

# Convert to DataFrame
combination_counts_df = combination_counts_series.reset_index()
print("After reset_index():")
print(combination_counts_df) # Column for counts is named 0
print()

# Rename the count column
combination_counts_df = combination_counts_df.rename(columns={0: 'Count'})

print("Counts of unique combinations as a DataFrame:")
print(combination_counts_df)

Output:

After reset_index():
Product_Category Region 0
0 Apparel North 1
1 Books North 1
2 Books South 1
3 Electronics East 1
4 Electronics North 2
5 Home West 1
6 North West 1

Counts of unique combinations as a DataFrame:
Product_Category Region Count
0 Apparel North 1
1 Books North 1
2 Books South 1
3 Electronics East 1
4 Electronics North 2
5 Home West 1
6 North West 1

Using reset_index(name='Count') directly is a more concise way to achieve the reset and rename in one step.

Using as_index=False in groupby()

If you set as_index=False in the groupby() call, the grouping columns are kept as regular columns, and size() will produce a DataFrame directly, though the count column might still need renaming (it's often named size).

import pandas as pd

df_example = pd.DataFrame({
'UserID': [101, 102, 101, 103, 102, 101, 104, 101],
'Product_Category': ['Electronics', 'Books', 'Electronics', 'Home', 'Books', 'Apparel', 'North', 'Electronics'],
'Region': ['North', 'South', 'North', 'West', 'North', 'North', 'West', 'East'],
})

# Use as_index=False
combination_counts_df_as_index_false = df_example.groupby(['Product_Category', 'Region'], as_index=False).size()

print("Counts using groupby(as_index=False).size():")
print(combination_counts_df_as_index_false)

Output:

Counts using groupby(as_index=False).size():
Product_Category Region size
0 Apparel North 1
1 Books North 1
2 Books South 1
3 Electronics East 1
4 Electronics North 2
5 Home West 1
6 North West 1

Method 2: Using DataFrame.value_counts() (Concise)

The DataFrame.value_counts(subset=None, normalize=False, sort=True, ascending=False, dropna=True) method is excellent for directly counting unique row occurrences based on a subset of columns.

Getting Counts as a Series (MultiIndex)

By default, value_counts() on a subset of columns returns a Series with a MultiIndex.

import pandas as pd

df_example = pd.DataFrame({
'UserID': [101, 102, 101, 103, 102, 101, 104, 101],
'Product_Category': ['Electronics', 'Books', 'Electronics', 'Home', 'Books', 'Apparel', 'North', 'Electronics'],
'Region': ['North', 'South', 'North', 'West', 'North', 'North', 'West', 'East'],
})

# Select the columns for which to count unique combinations
columns_for_value_counts = ['Product_Category', 'Region']
value_counts_series = df_example[columns_for_value_counts].value_counts()

print("Counts of unique combinations using value_counts() (Series):")
print(value_counts_series)

Output (order might vary due to default sorting by count):

Counts of unique combinations using value_counts() (Series):
Product_Category Region
Electronics North 2
Apparel North 1
Books North 1
South 1
Electronics East 1
Home West 1
North West 1
Name: count, dtype: int64

Converting to a DataFrame with reset_index(name='...')

Similar to the groupby().size() result, you can use reset_index() to convert this Series to a DataFrame.

import pandas as pd

df_example = pd.DataFrame({
'UserID': [101, 102, 101, 103, 102, 101, 104, 101],
'Product_Category': ['Electronics', 'Books', 'Electronics', 'Home', 'Books', 'Apparel', 'North', 'Electronics'],
'Region': ['North', 'South', 'North', 'West', 'North', 'North', 'West', 'East'],
})
columns_for_value_counts = ['Product_Category', 'Region']
value_counts_series = df_example[columns_for_value_counts].value_counts()

# Convert to DataFrame and name the count column
value_counts_df = value_counts_series.reset_index(name='Frequency')

print("Counts of unique combinations using value_counts() (DataFrame):")
print(value_counts_df)

Output:

Counts of unique combinations using value_counts() (DataFrame):
Product_Category Region Frequency
0 Electronics North 2
1 Apparel North 1
2 Books North 1
3 Books South 1
4 Electronics East 1
5 Home West 1
6 North West 1

Sorting Options

value_counts() sorts the results by count in descending order by default. Use ascending=True to sort in ascending order of counts, or sort=False to keep the order of appearance (though this might be less predictable for combinations).

# Sort by count ascending
value_counts_asc = df_example[columns_for_value_counts].value_counts(ascending=True).reset_index(name='Count')
print("Value counts (ascending):")
print(value_counts_asc)

Handling NaN Values (dropna parameter)

By default, value_counts(dropna=True) excludes combinations where any of the involved columns have a NaN value. Set dropna=False to include counts of combinations involving NaNs.

import pandas as pd
import numpy as np

df_with_nan = pd.DataFrame({
'Product_Category': ['Electronics', 'Books', np.nan, 'Electronics', np.nan],
'Region': ['North', 'South', 'North', 'North', 'West']
})

print("Value counts with NaN (dropna=True, default):")
print(df_with_nan[['Product_Category', 'Region']].value_counts(dropna=True).reset_index(name='Count'))
print()

print("Value counts with NaN (dropna=False):")
print(df_with_nan[['Product_Category', 'Region']].value_counts(dropna=False).reset_index(name='Count'))

Output:

Value counts with NaN (dropna=True, default):
Product_Category Region Count
0 Electronics North 2
1 Books South 1

Value counts with NaN (dropna=False):
Product_Category Region Count
0 Electronics North 2
1 Books South 1
2 NaN North 1
3 NaN West 1

Choosing the Right Method

  • df.groupby(cols_list).size():
    • Pros: Very flexible, part of the powerful groupby framework. Allows for further chained aggregations if needed. as_index=False can directly return a DataFrame.
    • Cons: Might require an explicit reset_index() and rename() if as_index=True (default).
  • df[cols_list].value_counts():
    • Pros: Very concise and often more direct for simple unique combination counting. Has convenient ascending and dropna parameters. Often names the count column by default (e.g., 'count' or 'proportion').
    • Cons: Primarily for counting; less direct if you immediately need other aggregations on the groups.

For simply counting unique combinations of two or more columns, df[cols_list].value_counts().reset_index(name='YourCountName') is often the most succinct and readable solution. groupby().size() is more general if you plan to do more with the groups.

Conclusion

Counting the unique combinations of values across multiple columns in a Pandas DataFrame is essential for understanding data distributions and relationships.

  • The DataFrame.groupby(['ColA', 'ColB']).size() method, followed by reset_index() and rename(), provides a flexible and robust way to achieve this. Using as_index=False in groupby can simplify this.
  • The df[['ColA', 'ColB']].value_counts() method offers a more concise and often more direct approach specifically tailored for counting unique row occurrences, easily convertible to a DataFrame using reset_index(name='count_col_name').

Both methods are effective. Choose the one that you find clearer and that best fits into your broader data analysis workflow.

Remember to consider how NaN values should be handled by using the dropna parameter in value_counts() or by preprocessing NaNs before groupby().