Skip to main content

Python Pandas: How to GroupBy Columns with NaN (Missing) Values

When performing groupby() operations in Pandas, the default behavior is to exclude rows where the grouping key(s) contain NaN (Not a Number) or missing values. However, there are scenarios where you might want to include these NaN values as a distinct group in your analysis or replace them with a placeholder before grouping.

This guide explains how Pandas groupby() handles NaNs by default and demonstrates methods to include NaNs as a group or to preprocess them using fillna() or astype(str).

Understanding groupby() Behavior with NaN Values

By default, when you use df.groupby('column_with_nan'), Pandas implicitly sets an internal parameter dropna=True. This means that any rows where 'column_with_nan' has a NaN value will be dropped from consideration before the groups are formed and aggregations are calculated. The NaN values effectively disappear from the grouping keys.

Example DataFrame with NaNs

We'll use the following DataFrame which contains NaN values in the 'Category' column, a common column to group by.

import pandas as pd
import numpy as np # For np.nan

data = {
'ProductID': [101, 102, 103, 104, 105, 106, 107, 108],
'Category': ['Electronics', 'Books', np.nan, 'Electronics', 'Home Goods', np.nan, 'Books', 'Home Goods'],
'Sales': [200, 25, 50, 300, 150, 40, 30, 120],
'Quantity': [2, 1, 3, 3, 5, 2, 1, 4]
}
df_original = pd.DataFrame(data)
print("Original DataFrame:")
print(df_original)

Output:

Original DataFrame:
ProductID Category Sales Quantity
0 101 Electronics 200 2
1 102 Books 25 1
2 103 NaN 50 3
3 104 Electronics 300 3
4 105 Home Goods 150 5
5 106 NaN 40 2
6 107 Books 30 1
7 108 Home Goods 120 4

The DataFrame.groupby() method has a dropna parameter. Setting dropna=False instructs Pandas to treat NaN values in the grouping key(s) as a valid group label.

import pandas as pd
import numpy as np

df = pd.DataFrame({
'ProductID': [101, 102, 103, 104, 105, 106, 107, 108],
'Category': ['Electronics', 'Books', np.nan, 'Electronics', 'Home Goods', np.nan, 'Books', 'Home Goods'],
'Sales': [200, 25, 50, 300, 150, 40, 30, 120],
'Quantity': [2, 1, 3, 3, 5, 2, 1, 4]
})

# Default behavior: dropna=True (NaNs in 'Category' are excluded from grouping keys)
print("GroupBy 'Category' (default, dropna=True) - sum of Sales:")
grouped_default = df.groupby('Category')['Sales'].sum()
print(grouped_default)
print()

# ✅ Include NaN as a group by setting dropna=False
print("GroupBy 'Category' (dropna=False) - sum of Sales:")
grouped_with_nan = df.groupby('Category', dropna=False)['Sales'].sum()
print(grouped_with_nan)

Output:

GroupBy 'Category' (default, dropna=True) - sum of Sales:
Category
Books 55
Electronics 500
Home Goods 270
Name: Sales, dtype: int64

GroupBy 'Category' (dropna=False) - sum of Sales:
Category
Books 55
Electronics 500
Home Goods 270
NaN 90
Name: Sales, dtype: int64
note

This is the most direct and idiomatic way to include NaNs as a separate category in your groupby operations. The NaN group will be labeled as NaN in the resulting grouped object's index.

Method 2: Replacing NaNs Before Grouping using fillna()

If you prefer to have a specific placeholder label for the NaN group instead of NaN itself, you can use DataFrame.fillna() to replace NaN values in the grouping column(s) before calling groupby().

import pandas as pd
import numpy as np

df = pd.DataFrame({
'ProductID': [101, 102, 103, 104, 105, 106, 107, 108],
'Category': ['Electronics', 'Books', np.nan, 'Electronics', 'Home Goods', np.nan, 'Books', 'Home Goods'],
'Sales': [200, 25, 50, 300, 150, 40, 30, 120],
'Quantity': [2, 1, 3, 3, 5, 2, 1, 4]
})

# Create a copy to modify for this example
df_filled = df.copy()

# ✅ Replace NaN in 'Category' with a placeholder string
placeholder = 'Uncategorized'
df_filled['Category'] = df_filled['Category'].fillna(placeholder)
# You can also do this for the whole DataFrame: df_filled = df.fillna({'Category': placeholder})

print(f"DataFrame after filling NaN in 'Category' with '{placeholder}':")
print(df_filled)
print()

# Now groupby will use the placeholder as a group label
grouped_filled = df_filled.groupby('Category')['Sales'].mean() # Example: get mean sales
print(f"GroupBy 'Category' after fillna('{placeholder}') - mean Sales:")
print(grouped_filled)

Output:

DataFrame after filling NaN in 'Category' with 'Uncategorized':
ProductID Category Sales Quantity
0 101 Electronics 200 2
1 102 Books 25 1
2 103 Uncategorized 50 3
3 104 Electronics 300 3
4 105 Home Goods 150 5
5 106 Uncategorized 40 2
6 107 Books 30 1
7 108 Home Goods 120 4

GroupBy 'Category' after fillna('Uncategorized') - mean Sales:
Category
Books 27.5
Electronics 250.0
Home Goods 135.0
Uncategorized 45.0
Name: Sales, dtype: float64
  • Choose a placeholder string that makes sense for your data (e.g., "Unknown", "Missing", "N/A").
  • This method modifies the data before grouping. If you need to preserve the original NaNs elsewhere, work on a copy of the DataFrame or the specific column.

Method 3: Converting Grouping Column to String using astype(str)

Another way to ensure NaN values are treated as a distinct group is to convert the entire grouping column to the string data type using astype(str). NaN values will be converted to the literal string "nan".

import pandas as pd
import numpy as np

df = pd.DataFrame({
'ProductID': [101, 102, 103, 104, 105, 106, 107, 108],
'Category': ['Electronics', 'Books', np.nan, 'Electronics', 'Home Goods', np.nan, 'Books', 'Home Goods'],
'Sales': [200, 25, 50, 300, 150, 40, 30, 120],
'Quantity': [2, 1, 3, 3, 5, 2, 1, 4]
})

df_str_type = df.copy()

# ✅ Convert the 'Category' column to string type
df_str_type['Category'] = df_str_type['Category'].astype(str)
print("'Category' column after astype(str):")
print(df_str_type['Category'])
print()

# Groupby will now treat "nan" as a regular string category
grouped_str_type = df_str_type.groupby('Category')['Quantity'].sum() # Example: sum quantity
print("GroupBy 'Category' after astype(str) - sum Quantity:")
print(grouped_str_type)

Output:

'Category' column after astype(str):
0 Electronics
1 Books
2 nan
3 Electronics
4 Home Goods
5 nan
6 Books
7 Home Goods
Name: Category, dtype: object

GroupBy 'Category' after astype(str) - sum Quantity:
Category
Books 2
Electronics 5
Home Goods 9
nan 5
Name: Quantity, dtype: int64
note

This method also changes the data. The original NaN values are now string literals "nan".

Choosing the Right Method

  • groupby('col', dropna=False): Most recommended if you want to include NaNs as a distinct group in your aggregation without altering the original data and are happy with the NaN label in the result. It's clean and direct.
  • fillna('placeholder').groupby('col'): Use this if you need a specific, human-readable label (other than NaN) for the group of missing values in your groupby output. This modifies the data before grouping.
  • astype(str).groupby('col'): Use this if you want to treat all values in the grouping column, including NaNs (which become "nan" strings), uniformly as strings during grouping. This also modifies the data.

Conclusion

Pandas groupby() by default excludes NaN values from the grouping keys. To include them:

  1. Set the dropna=False parameter in your groupby() call: df.groupby('column_with_nan', dropna=False). This is the most direct way to treat NaNs as a separate group.
  2. Alternatively, preprocess the NaN values in your grouping column before calling groupby():
    • Use df['col'].fillna('SomePlaceholder') to replace NaNs with a specific string label.
    • Use df['col'].astype(str) to convert all values (including NaNs, which become the string "nan") to strings.

Choosing the best approach depends on whether you want to modify your original data and how you want the NaN group to be represented in the output of your aggregation. The dropna=False option is often the cleanest for including NaNs without data modification.