Skip to main content

Python Pandas: How to Calculate the Percentage of Missing Values per Column

Identifying missing data (often represented as NaN or None) is a critical first step in any data cleaning and preparation workflow. Understanding the extent of missingness, especially as a percentage per column, helps in deciding appropriate imputation strategies or whether a column/row should be dropped.

This guide will provide you with clear, step-by-step methods to calculate the percentage of missing values for each column in a Pandas DataFrame. You'll learn two primary approaches using isnull().sum() and isnull().mean(), and how to present these results in a new, sorted DataFrame for easy interpretation.

The Importance of Quantifying Missing Data

Before diving into analysis or model building, it's essential to know how much data is missing and where. High percentages of missing values in certain columns might:

  • Indicate data collection issues.
  • Reduce the statistical power of analyses.
  • Bias results if not handled appropriately.
  • Require specific imputation techniques or lead to column removal. Calculating percentages gives a standardized measure of missingness across columns of different lengths or types.

Let's use a sample DataFrame with missing values:

import pandas as pd
import numpy as np # For np.nan

df = pd.DataFrame({
'product_id': [101, 102, 103, 104, 105, 106],
'category': ['Electronics', np.nan, 'Books', 'Electronics', np.nan, np.nan],
'price': [199.99, 59.50, np.nan, 249.00, 19.99, np.nan],
'in_stock': [True, False, True, True, np.nan, False]
})

print("Original DataFrame:")
print(df)

Output:

Original DataFrame:
product_id category price in_stock
0 101 Electronics 199.99 True
1 102 NaN 59.50 False
2 103 Books NaN True
3 104 Electronics 249.00 True
4 105 NaN 19.99 NaN
5 106 NaN NaN False

Method 1: Using isnull().sum() for Percentage Calculation

This is a very common and intuitive approach.

Detecting Missing Values with isnull()

The DataFrame.isnull() (or its alias DataFrame.isna()) method returns a DataFrame of the same shape, with True where values are missing (NaN, None, NaT) and False otherwise.

import pandas as pd
import numpy as np

df = pd.DataFrame({
'product_id': [101, 102, 103, 104, 105, 106],
'category': ['Electronics', np.nan, 'Books', 'Electronics', np.nan, np.nan],
'price': [199.99, 59.50, np.nan, 249.00, 19.99, np.nan],
'in_stock': [True, False, True, True, np.nan, False]
})

print("Output of df.isnull():")
print(df.isnull())

Output:

Output of df.isnull():
product_id category price in_stock
0 False False False False
1 False True False False
2 False False True False
3 False False False False
4 False True False True
5 False True True False

Calculating the Percentage

When you call .sum() on this boolean DataFrame, True values are treated as 1 and False as 0. This gives the count of missing values per column. To get the percentage: percentage = (count_missing / total_rows) * 100

import pandas as pd
import numpy as np

df = pd.DataFrame({
'product_id': [101, 102, 103, 104, 105, 106],
'category': ['Electronics', np.nan, 'Books', 'Electronics', np.nan, np.nan],
'price': [199.99, 59.50, np.nan, 249.00, 19.99, np.nan],
'in_stock': [True, False, True, True, np.nan, False]
})

# Calculate count of missing values per column
missing_counts = df.isnull().sum()
print("Missing value counts per column:")
print(missing_counts)
print()

# Calculate percentage
total_rows = len(df)
percent_missing_sum_method = (missing_counts / total_rows) * 100
print("Percentage of missing values (using .sum()):")
print(percent_missing_sum_method)

Output:

Missing value counts per column:
product_id 0
category 3
price 2
in_stock 1
dtype: int64

Percentage of missing values (using .sum()):
product_id 0.000000
category 50.000000
price 33.333333
in_stock 16.666667
dtype: float64

Presenting Results in a New DataFrame

It's often useful to see these percentages in a structured DataFrame.

import pandas as pd
import numpy as np

# df and percent_missing_sum_method defined as above
df = pd.DataFrame({
'product_id': [101, 102, 103, 104, 105, 106],
'category': ['Electronics', np.nan, 'Books', 'Electronics', np.nan, np.nan],
'price': [199.99, 59.50, np.nan, 249.00, 19.99, np.nan],
'in_stock': [True, False, True, True, np.nan, False]
})

missing_counts = df.isnull().sum()
total_rows = len(df)
percent_missing_sum_method = (missing_counts / total_rows) * 100


missing_summary_df_sum = pd.DataFrame({
'column_name': df.columns,
'percent_missing': percent_missing_sum_method
})
# Resetting index if you want column_name as a regular column,
# otherwise the original column names will be the index.
# For this display, having original column names as index is fine.

print("Missing value percentages in a DataFrame (using .sum()):")
print(missing_summary_df_sum)
print()

# Alternative way to create the DataFrame directly from the Series:
missing_summary_df_sum_alt = pd.DataFrame(percent_missing_sum_method, columns=['percent_missing'])
missing_summary_df_sum_alt.index.name = 'column_name' # Set index name
missing_summary_df_sum_alt.reset_index(inplace=True) # Make column_name a regular column
print("Alternative DataFrame creation:")
print(missing_summary_df_sum_alt)

Output:

Missing value percentages in a DataFrame (using .sum()):
column_name percent_missing
product_id product_id 0.000000
category category 50.000000
price price 33.333333
in_stock in_stock 16.666667

Alternative DataFrame creation:
column_name percent_missing
0 product_id 0.000000
1 category 50.000000
2 price 33.333333
3 in_stock 16.666667

Sorting the Results by Percentage Missing

Use DataFrame.sort_values() to easily identify columns with the most missing data.

import pandas as pd
import numpy as np

# --- Defined as before ---
# df and percent_missing_sum_method defined as above
df = pd.DataFrame({
'product_id': [101, 102, 103, 104, 105, 106],
'category': ['Electronics', np.nan, 'Books', 'Electronics', np.nan, np.nan],
'price': [199.99, 59.50, np.nan, 249.00, 19.99, np.nan],
'in_stock': [True, False, True, True, np.nan, False]
})

missing_counts = df.isnull().sum()
total_rows = len(df)
percent_missing_sum_method = (missing_counts / total_rows) * 100


missing_summary_df_sum = pd.DataFrame({
'column_name': df.columns,
'percent_missing': percent_missing_sum_method
})
missing_summary_df_sum_alt = pd.DataFrame(percent_missing_sum_method, columns=['percent_missing'])
missing_summary_df_sum_alt.index.name = 'column_name' # Set index name
missing_summary_df_sum_alt.reset_index(inplace=True) # Make column_name a regular column
print("DataFrame creation:")
print(missing_summary_df_sum_alt)
print()
# --- --- ---

# Sort by 'percent_missing' in descending order
sorted_missing_sum = missing_summary_df_sum_alt.sort_values(by='percent_missing', ascending=False)

print("Sorted missing value percentages (using .sum()):")
print(sorted_missing_sum)

Output:

DataFrame creation:
column_name percent_missing
0 product_id 0.000000
1 category 50.000000
2 price 33.333333
3 in_stock 16.666667

Sorted missing value percentages (using .sum()):
column_name percent_missing
1 category 50.000000
2 price 33.333333
3 in_stock 16.666667
0 product_id 0.000000

Method 2: Using isnull().mean() for a More Direct Percentage

This method is more concise. When .mean() is applied to a boolean Series (or DataFrame), True is treated as 1 and False as 0. The mean of these 0s and 1s directly gives the proportion of True values (i.e., the proportion of missing values).

How mean() on a Boolean Series Calculates Proportion**

For a column, df['col'].isnull().mean() is equivalent to df['col'].isnull().sum() / len(df['col']).

Calculating and Formatting the Percentage**

import pandas as pd
import numpy as np

# df defined as before
df = pd.DataFrame({
'product_id': [101, 102, 103, 104, 105, 106],
'category': ['Electronics', np.nan, 'Books', 'Electronics', np.nan, np.nan],
'price': [199.99, 59.50, np.nan, 249.00, 19.99, np.nan],
'in_stock': [True, False, True, True, np.nan, False]
})

# Calculate proportion of missing values and multiply by 100
percent_missing_mean_method = df.isnull().mean() * 100

print("Percentage of missing values (using .mean()):")
print(percent_missing_mean_method)
print()

# You can also chain .round() for better formatting and .mul() for multiplication
percent_missing_formatted = df.isnull().mean().round(4).mul(100)
print("Formatted percentage (using .mean(), .round(), .mul()):")
print(percent_missing_formatted)

Output:

Percentage of missing values (using .mean()):
product_id 0.000000
category 50.000000
price 33.333333
in_stock 16.666667
dtype: float64

Formatted percentage (using .mean(), .round(), .mul()):
product_id 0.00
category 50.00
price 33.33
in_stock 16.67
dtype: float64

Presenting in a DataFrame and Sorting

This follows the same logic as with the .sum() method.

import pandas as pd
import numpy as np

# df and percent_missing_formatted defined as above
df = pd.DataFrame({
'product_id': [101, 102, 103, 104, 105, 106],
'category': ['Electronics', np.nan, 'Books', 'Electronics', np.nan, np.nan],
'price': [199.99, 59.50, np.nan, 249.00, 19.99, np.nan],
'in_stock': [True, False, True, True, np.nan, False]
})

# Calculate proportion of missing values and multiply by 100
percent_missing_mean_method = df.isnull().mean() * 100
percent_missing_formatted = df.isnull().mean().round(4).mul(100)


missing_summary_df_mean = pd.DataFrame(
percent_missing_formatted,
columns=['percent_missing']
).sort_values(by='percent_missing', ascending=False)

missing_summary_df_mean.index.name = 'column_name' # Optional: name the index
# To make 'column_name' a regular column instead of index:
# missing_summary_df_mean.reset_index(inplace=True)

print("Sorted missing value percentages in DataFrame (using .mean()):")
print(missing_summary_df_mean)

Output:

Sorted missing value percentages in DataFrame (using .mean()):
percent_missing
column_name
category 50.00
price 33.33
in_stock 16.67
product_id 0.00

Choosing Your Preferred Method

  • isnull().sum() * 100 / len(df) (Method 1): This approach is often more explicit and easier to understand for beginners as it follows the direct formula for percentage calculation (count of missing / total count * 100).
  • isnull().mean() * 100 (Method 2): This is more concise and idiomatic in Pandas once you understand that .mean() on a boolean Series gives the proportion of Trues. It's often slightly more performant for very large DataFrames.

Both methods achieve the same correct result. Choose the one that you find more readable and maintainable.

Conclusion

Calculating the percentage of missing values per column is a fundamental step in data quality assessment with Pandas. Both the df.isnull().sum() * 100 / len(df) and the more direct df.isnull().mean() * 100 approaches effectively provide this information. Presenting these percentages in a sorted DataFrame helps prioritize columns for data cleaning or imputation strategies, leading to more robust and reliable downstream analyses.