Python Pandas: How to GroupBy Columns with NaN (Missing) Values
When performing groupby()
operations in Pandas, the default behavior is to exclude rows where the grouping key(s) contain NaN
(Not a Number) or missing values. However, there are scenarios where you might want to include these NaN
values as a distinct group in your analysis or replace them with a placeholder before grouping.
This guide explains how Pandas groupby()
handles NaN
s by default and demonstrates methods to include NaN
s as a group or to preprocess them using fillna()
or astype(str)
.
Understanding groupby()
Behavior with NaN
Values
By default, when you use df.groupby('column_with_nan')
, Pandas implicitly sets an internal parameter dropna=True
. This means that any rows where 'column_with_nan' has a NaN
value will be dropped from consideration before the groups are formed and aggregations are calculated. The NaN
values effectively disappear from the grouping keys.
Example DataFrame with NaN
s
We'll use the following DataFrame which contains NaN
values in the 'Category' column, a common column to group by.
import pandas as pd
import numpy as np # For np.nan
data = {
'ProductID': [101, 102, 103, 104, 105, 106, 107, 108],
'Category': ['Electronics', 'Books', np.nan, 'Electronics', 'Home Goods', np.nan, 'Books', 'Home Goods'],
'Sales': [200, 25, 50, 300, 150, 40, 30, 120],
'Quantity': [2, 1, 3, 3, 5, 2, 1, 4]
}
df_original = pd.DataFrame(data)
print("Original DataFrame:")
print(df_original)
Output:
Original DataFrame:
ProductID Category Sales Quantity
0 101 Electronics 200 2
1 102 Books 25 1
2 103 NaN 50 3
3 104 Electronics 300 3
4 105 Home Goods 150 5
5 106 NaN 40 2
6 107 Books 30 1
7 108 Home Goods 120 4
Method 1: Including NaN
s as a Group using dropna=False
(Recommended)
The DataFrame.groupby()
method has a dropna
parameter. Setting dropna=False
instructs Pandas to treat NaN
values in the grouping key(s) as a valid group label.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ProductID': [101, 102, 103, 104, 105, 106, 107, 108],
'Category': ['Electronics', 'Books', np.nan, 'Electronics', 'Home Goods', np.nan, 'Books', 'Home Goods'],
'Sales': [200, 25, 50, 300, 150, 40, 30, 120],
'Quantity': [2, 1, 3, 3, 5, 2, 1, 4]
})
# Default behavior: dropna=True (NaNs in 'Category' are excluded from grouping keys)
print("GroupBy 'Category' (default, dropna=True) - sum of Sales:")
grouped_default = df.groupby('Category')['Sales'].sum()
print(grouped_default)
print()
# ✅ Include NaN as a group by setting dropna=False
print("GroupBy 'Category' (dropna=False) - sum of Sales:")
grouped_with_nan = df.groupby('Category', dropna=False)['Sales'].sum()
print(grouped_with_nan)
Output:
GroupBy 'Category' (default, dropna=True) - sum of Sales:
Category
Books 55
Electronics 500
Home Goods 270
Name: Sales, dtype: int64
GroupBy 'Category' (dropna=False) - sum of Sales:
Category
Books 55
Electronics 500
Home Goods 270
NaN 90
Name: Sales, dtype: int64
This is the most direct and idiomatic way to include NaN
s as a separate category in your groupby
operations. The NaN
group will be labeled as NaN
in the resulting grouped object's index.
Method 2: Replacing NaN
s Before Grouping using fillna()
If you prefer to have a specific placeholder label for the NaN
group instead of NaN
itself, you can use DataFrame.fillna()
to replace NaN
values in the grouping column(s) before calling groupby()
.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ProductID': [101, 102, 103, 104, 105, 106, 107, 108],
'Category': ['Electronics', 'Books', np.nan, 'Electronics', 'Home Goods', np.nan, 'Books', 'Home Goods'],
'Sales': [200, 25, 50, 300, 150, 40, 30, 120],
'Quantity': [2, 1, 3, 3, 5, 2, 1, 4]
})
# Create a copy to modify for this example
df_filled = df.copy()
# ✅ Replace NaN in 'Category' with a placeholder string
placeholder = 'Uncategorized'
df_filled['Category'] = df_filled['Category'].fillna(placeholder)
# You can also do this for the whole DataFrame: df_filled = df.fillna({'Category': placeholder})
print(f"DataFrame after filling NaN in 'Category' with '{placeholder}':")
print(df_filled)
print()
# Now groupby will use the placeholder as a group label
grouped_filled = df_filled.groupby('Category')['Sales'].mean() # Example: get mean sales
print(f"GroupBy 'Category' after fillna('{placeholder}') - mean Sales:")
print(grouped_filled)
Output:
DataFrame after filling NaN in 'Category' with 'Uncategorized':
ProductID Category Sales Quantity
0 101 Electronics 200 2
1 102 Books 25 1
2 103 Uncategorized 50 3
3 104 Electronics 300 3
4 105 Home Goods 150 5
5 106 Uncategorized 40 2
6 107 Books 30 1
7 108 Home Goods 120 4
GroupBy 'Category' after fillna('Uncategorized') - mean Sales:
Category
Books 27.5
Electronics 250.0
Home Goods 135.0
Uncategorized 45.0
Name: Sales, dtype: float64
- Choose a
placeholder
string that makes sense for your data (e.g., "Unknown", "Missing", "N/A"). - This method modifies the data before grouping. If you need to preserve the original
NaN
s elsewhere, work on a copy of the DataFrame or the specific column.
Method 3: Converting Grouping Column to String using astype(str)
Another way to ensure NaN
values are treated as a distinct group is to convert the entire grouping column to the string data type using astype(str)
. NaN
values will be converted to the literal string "nan"
.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ProductID': [101, 102, 103, 104, 105, 106, 107, 108],
'Category': ['Electronics', 'Books', np.nan, 'Electronics', 'Home Goods', np.nan, 'Books', 'Home Goods'],
'Sales': [200, 25, 50, 300, 150, 40, 30, 120],
'Quantity': [2, 1, 3, 3, 5, 2, 1, 4]
})
df_str_type = df.copy()
# ✅ Convert the 'Category' column to string type
df_str_type['Category'] = df_str_type['Category'].astype(str)
print("'Category' column after astype(str):")
print(df_str_type['Category'])
print()
# Groupby will now treat "nan" as a regular string category
grouped_str_type = df_str_type.groupby('Category')['Quantity'].sum() # Example: sum quantity
print("GroupBy 'Category' after astype(str) - sum Quantity:")
print(grouped_str_type)
Output:
'Category' column after astype(str):
0 Electronics
1 Books
2 nan
3 Electronics
4 Home Goods
5 nan
6 Books
7 Home Goods
Name: Category, dtype: object
GroupBy 'Category' after astype(str) - sum Quantity:
Category
Books 2
Electronics 5
Home Goods 9
nan 5
Name: Quantity, dtype: int64
This method also changes the data. The original NaN
values are now string literals "nan"
.
Choosing the Right Method
groupby('col', dropna=False)
: Most recommended if you want to includeNaN
s as a distinct group in your aggregation without altering the original data and are happy with theNaN
label in the result. It's clean and direct.fillna('placeholder').groupby('col')
: Use this if you need a specific, human-readable label (other thanNaN
) for the group of missing values in yourgroupby
output. This modifies the data before grouping.astype(str).groupby('col')
: Use this if you want to treat all values in the grouping column, includingNaN
s (which become"nan"
strings), uniformly as strings during grouping. This also modifies the data.
Conclusion
Pandas groupby()
by default excludes NaN
values from the grouping keys. To include them:
- Set the
dropna=False
parameter in yourgroupby()
call:df.groupby('column_with_nan', dropna=False)
. This is the most direct way to treatNaN
s as a separate group. - Alternatively, preprocess the
NaN
values in your grouping column before callinggroupby()
:- Use
df['col'].fillna('SomePlaceholder')
to replaceNaN
s with a specific string label. - Use
df['col'].astype(str)
to convert all values (includingNaN
s, which become the string"nan"
) to strings.
- Use
Choosing the best approach depends on whether you want to modify your original data and how you want the NaN
group to be represented in the output of your aggregation. The dropna=False
option is often the cleanest for including NaN
s without data modification.