Python Pandas: How to Calculate the Percentage of Missing Values per Column
Identifying missing data (often represented as NaN
or None
) is a critical first step in any data cleaning and preparation workflow. Understanding the extent of missingness, especially as a percentage per column, helps in deciding appropriate imputation strategies or whether a column/row should be dropped.
This guide will provide you with clear, step-by-step methods to calculate the percentage of missing values for each column in a Pandas DataFrame. You'll learn two primary approaches using isnull().sum()
and isnull().mean()
, and how to present these results in a new, sorted DataFrame for easy interpretation.
The Importance of Quantifying Missing Data
Before diving into analysis or model building, it's essential to know how much data is missing and where. High percentages of missing values in certain columns might:
- Indicate data collection issues.
- Reduce the statistical power of analyses.
- Bias results if not handled appropriately.
- Require specific imputation techniques or lead to column removal. Calculating percentages gives a standardized measure of missingness across columns of different lengths or types.
Let's use a sample DataFrame with missing values:
import pandas as pd
import numpy as np # For np.nan
df = pd.DataFrame({
'product_id': [101, 102, 103, 104, 105, 106],
'category': ['Electronics', np.nan, 'Books', 'Electronics', np.nan, np.nan],
'price': [199.99, 59.50, np.nan, 249.00, 19.99, np.nan],
'in_stock': [True, False, True, True, np.nan, False]
})
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
product_id category price in_stock
0 101 Electronics 199.99 True
1 102 NaN 59.50 False
2 103 Books NaN True
3 104 Electronics 249.00 True
4 105 NaN 19.99 NaN
5 106 NaN NaN False
Method 1: Using isnull().sum()
for Percentage Calculation
This is a very common and intuitive approach.
Detecting Missing Values with isnull()
The DataFrame.isnull()
(or its alias DataFrame.isna()
) method returns a DataFrame of the same shape, with True
where values are missing (NaN
, None
, NaT
) and False
otherwise.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'product_id': [101, 102, 103, 104, 105, 106],
'category': ['Electronics', np.nan, 'Books', 'Electronics', np.nan, np.nan],
'price': [199.99, 59.50, np.nan, 249.00, 19.99, np.nan],
'in_stock': [True, False, True, True, np.nan, False]
})
print("Output of df.isnull():")
print(df.isnull())
Output:
Output of df.isnull():
product_id category price in_stock
0 False False False False
1 False True False False
2 False False True False
3 False False False False
4 False True False True
5 False True True False
Calculating the Percentage
When you call .sum()
on this boolean DataFrame, True
values are treated as 1
and False
as 0
. This gives the count of missing values per column. To get the percentage:
percentage = (count_missing / total_rows) * 100
import pandas as pd
import numpy as np
df = pd.DataFrame({
'product_id': [101, 102, 103, 104, 105, 106],
'category': ['Electronics', np.nan, 'Books', 'Electronics', np.nan, np.nan],
'price': [199.99, 59.50, np.nan, 249.00, 19.99, np.nan],
'in_stock': [True, False, True, True, np.nan, False]
})
# Calculate count of missing values per column
missing_counts = df.isnull().sum()
print("Missing value counts per column:")
print(missing_counts)
print()
# Calculate percentage
total_rows = len(df)
percent_missing_sum_method = (missing_counts / total_rows) * 100
print("Percentage of missing values (using .sum()):")
print(percent_missing_sum_method)
Output:
Missing value counts per column:
product_id 0
category 3
price 2
in_stock 1
dtype: int64
Percentage of missing values (using .sum()):
product_id 0.000000
category 50.000000
price 33.333333
in_stock 16.666667
dtype: float64
Presenting Results in a New DataFrame
It's often useful to see these percentages in a structured DataFrame.
import pandas as pd
import numpy as np
# df and percent_missing_sum_method defined as above
df = pd.DataFrame({
'product_id': [101, 102, 103, 104, 105, 106],
'category': ['Electronics', np.nan, 'Books', 'Electronics', np.nan, np.nan],
'price': [199.99, 59.50, np.nan, 249.00, 19.99, np.nan],
'in_stock': [True, False, True, True, np.nan, False]
})
missing_counts = df.isnull().sum()
total_rows = len(df)
percent_missing_sum_method = (missing_counts / total_rows) * 100
missing_summary_df_sum = pd.DataFrame({
'column_name': df.columns,
'percent_missing': percent_missing_sum_method
})
# Resetting index if you want column_name as a regular column,
# otherwise the original column names will be the index.
# For this display, having original column names as index is fine.
print("Missing value percentages in a DataFrame (using .sum()):")
print(missing_summary_df_sum)
print()
# Alternative way to create the DataFrame directly from the Series:
missing_summary_df_sum_alt = pd.DataFrame(percent_missing_sum_method, columns=['percent_missing'])
missing_summary_df_sum_alt.index.name = 'column_name' # Set index name
missing_summary_df_sum_alt.reset_index(inplace=True) # Make column_name a regular column
print("Alternative DataFrame creation:")
print(missing_summary_df_sum_alt)
Output:
Missing value percentages in a DataFrame (using .sum()):
column_name percent_missing
product_id product_id 0.000000
category category 50.000000
price price 33.333333
in_stock in_stock 16.666667
Alternative DataFrame creation:
column_name percent_missing
0 product_id 0.000000
1 category 50.000000
2 price 33.333333
3 in_stock 16.666667
Sorting the Results by Percentage Missing
Use DataFrame.sort_values()
to easily identify columns with the most missing data.
import pandas as pd
import numpy as np
# --- Defined as before ---
# df and percent_missing_sum_method defined as above
df = pd.DataFrame({
'product_id': [101, 102, 103, 104, 105, 106],
'category': ['Electronics', np.nan, 'Books', 'Electronics', np.nan, np.nan],
'price': [199.99, 59.50, np.nan, 249.00, 19.99, np.nan],
'in_stock': [True, False, True, True, np.nan, False]
})
missing_counts = df.isnull().sum()
total_rows = len(df)
percent_missing_sum_method = (missing_counts / total_rows) * 100
missing_summary_df_sum = pd.DataFrame({
'column_name': df.columns,
'percent_missing': percent_missing_sum_method
})
missing_summary_df_sum_alt = pd.DataFrame(percent_missing_sum_method, columns=['percent_missing'])
missing_summary_df_sum_alt.index.name = 'column_name' # Set index name
missing_summary_df_sum_alt.reset_index(inplace=True) # Make column_name a regular column
print("DataFrame creation:")
print(missing_summary_df_sum_alt)
print()
# --- --- ---
# Sort by 'percent_missing' in descending order
sorted_missing_sum = missing_summary_df_sum_alt.sort_values(by='percent_missing', ascending=False)
print("Sorted missing value percentages (using .sum()):")
print(sorted_missing_sum)
Output:
DataFrame creation:
column_name percent_missing
0 product_id 0.000000
1 category 50.000000
2 price 33.333333
3 in_stock 16.666667
Sorted missing value percentages (using .sum()):
column_name percent_missing
1 category 50.000000
2 price 33.333333
3 in_stock 16.666667
0 product_id 0.000000
Method 2: Using isnull().mean()
for a More Direct Percentage
This method is more concise. When .mean()
is applied to a boolean Series (or DataFrame), True
is treated as 1
and False
as 0
. The mean of these 0
s and 1
s directly gives the proportion of True
values (i.e., the proportion of missing values).
How mean()
on a Boolean Series Calculates Proportion**
For a column, df['col'].isnull().mean()
is equivalent to df['col'].isnull().sum() / len(df['col'])
.
Calculating and Formatting the Percentage**
import pandas as pd
import numpy as np
# df defined as before
df = pd.DataFrame({
'product_id': [101, 102, 103, 104, 105, 106],
'category': ['Electronics', np.nan, 'Books', 'Electronics', np.nan, np.nan],
'price': [199.99, 59.50, np.nan, 249.00, 19.99, np.nan],
'in_stock': [True, False, True, True, np.nan, False]
})
# Calculate proportion of missing values and multiply by 100
percent_missing_mean_method = df.isnull().mean() * 100
print("Percentage of missing values (using .mean()):")
print(percent_missing_mean_method)
print()
# You can also chain .round() for better formatting and .mul() for multiplication
percent_missing_formatted = df.isnull().mean().round(4).mul(100)
print("Formatted percentage (using .mean(), .round(), .mul()):")
print(percent_missing_formatted)
Output:
Percentage of missing values (using .mean()):
product_id 0.000000
category 50.000000
price 33.333333
in_stock 16.666667
dtype: float64
Formatted percentage (using .mean(), .round(), .mul()):
product_id 0.00
category 50.00
price 33.33
in_stock 16.67
dtype: float64
Presenting in a DataFrame and Sorting
This follows the same logic as with the .sum()
method.
import pandas as pd
import numpy as np
# df and percent_missing_formatted defined as above
df = pd.DataFrame({
'product_id': [101, 102, 103, 104, 105, 106],
'category': ['Electronics', np.nan, 'Books', 'Electronics', np.nan, np.nan],
'price': [199.99, 59.50, np.nan, 249.00, 19.99, np.nan],
'in_stock': [True, False, True, True, np.nan, False]
})
# Calculate proportion of missing values and multiply by 100
percent_missing_mean_method = df.isnull().mean() * 100
percent_missing_formatted = df.isnull().mean().round(4).mul(100)
missing_summary_df_mean = pd.DataFrame(
percent_missing_formatted,
columns=['percent_missing']
).sort_values(by='percent_missing', ascending=False)
missing_summary_df_mean.index.name = 'column_name' # Optional: name the index
# To make 'column_name' a regular column instead of index:
# missing_summary_df_mean.reset_index(inplace=True)
print("Sorted missing value percentages in DataFrame (using .mean()):")
print(missing_summary_df_mean)
Output:
Sorted missing value percentages in DataFrame (using .mean()):
percent_missing
column_name
category 50.00
price 33.33
in_stock 16.67
product_id 0.00
Choosing Your Preferred Method
isnull().sum() * 100 / len(df)
(Method 1): This approach is often more explicit and easier to understand for beginners as it follows the direct formula for percentage calculation (count of missing / total count * 100).isnull().mean() * 100
(Method 2): This is more concise and idiomatic in Pandas once you understand that.mean()
on a boolean Series gives the proportion ofTrue
s. It's often slightly more performant for very large DataFrames.
Both methods achieve the same correct result. Choose the one that you find more readable and maintainable.
Conclusion
Calculating the percentage of missing values per column is a fundamental step in data quality assessment with Pandas. Both the df.isnull().sum() * 100 / len(df)
and the more direct df.isnull().mean() * 100
approaches effectively provide this information. Presenting these percentages in a sorted DataFrame helps prioritize columns for data cleaning or imputation strategies, leading to more robust and reliable downstream analyses.