Python Pandas: How to Calculate the Percentage of Missing Values per Column

Identifying missing data (often represented as NaN or None) is a critical first step in any data cleaning and preparation workflow. Understanding the extent of missingness, especially as a percentage per column, helps in deciding appropriate imputation strategies or whether a column/row should be dropped.

This guide will provide you with clear, step-by-step methods to calculate the percentage of missing values for each column in a Pandas DataFrame. You'll learn two primary approaches using isnull().sum() and isnull().mean(), and how to present these results in a new, sorted DataFrame for easy interpretation.

The Importance of Quantifying Missing Data

Before diving into analysis or model building, it's essential to know how much data is missing and where. High percentages of missing values in certain columns might:

Indicate data collection issues.
Reduce the statistical power of analyses.
Bias results if not handled appropriately.
Require specific imputation techniques or lead to column removal. Calculating percentages gives a standardized measure of missingness across columns of different lengths or types.

Let's use a sample DataFrame with missing values:

import pandas as pd
import numpy as np  # For np.nan

df = pd.DataFrame({
    'product_id': [101, 102, 103, 104, 105, 106],
    'category': ['Electronics', np.nan, 'Books', 'Electronics', np.nan, np.nan],
    'price': [199.99, 59.50, np.nan, 249.00, 19.99, np.nan],
    'in_stock': [True, False, True, True, np.nan, False]
})

print("Original DataFrame:")
print(df)

Output:

Original DataFrame:
   product_id     category   price in_stock
       101  Electronics  199.99     True
       102          NaN   59.50    False
       103        Books     NaN     True
       104  Electronics  249.00     True
       105          NaN   19.99      NaN
       106          NaN     NaN    False

Method 1: Using `isnull().sum()` for Percentage Calculation

This is a very common and intuitive approach.

Detecting Missing Values with `isnull()`

The DataFrame.isnull() (or its alias DataFrame.isna()) method returns a DataFrame of the same shape, with True where values are missing (NaN, None, NaT) and False otherwise.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'product_id': [101, 102, 103, 104, 105, 106],
    'category': ['Electronics', np.nan, 'Books', 'Electronics', np.nan, np.nan],
    'price': [199.99, 59.50, np.nan, 249.00, 19.99, np.nan],
    'in_stock': [True, False, True, True, np.nan, False]
})

print("Output of df.isnull():")
print(df.isnull())

Output:

Output of df.isnull():
   product_id  category  price  in_stock
     False     False  False     False
     False      True  False     False
     False     False   True     False
     False     False  False     False
     False      True  False      True
     False      True   True     False

Calculating the Percentage

When you call .sum() on this boolean DataFrame, True values are treated as 1 and False as 0. This gives the count of missing values per column. To get the percentage: percentage = (count_missing / total_rows) * 100

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'product_id': [101, 102, 103, 104, 105, 106],
    'category': ['Electronics', np.nan, 'Books', 'Electronics', np.nan, np.nan],
    'price': [199.99, 59.50, np.nan, 249.00, 19.99, np.nan],
    'in_stock': [True, False, True, True, np.nan, False]
})

# Calculate count of missing values per column
missing_counts = df.isnull().sum()
print("Missing value counts per column:")
print(missing_counts)
print()

# Calculate percentage
total_rows = len(df)
percent_missing_sum_method = (missing_counts / total_rows) * 100
print("Percentage of missing values (using .sum()):")
print(percent_missing_sum_method)

Output:

Missing value counts per column:
product_id    0
category      3
price         2
in_stock      1
dtype: int64

Percentage of missing values (using .sum()):
product_id     0.000000
category      50.000000
price         33.333333
in_stock      16.666667
dtype: float64

Presenting Results in a New DataFrame

It's often useful to see these percentages in a structured DataFrame.

import pandas as pd
import numpy as np

# df and percent_missing_sum_method defined as above
df = pd.DataFrame({
    'product_id': [101, 102, 103, 104, 105, 106],
    'category': ['Electronics', np.nan, 'Books', 'Electronics', np.nan, np.nan],
    'price': [199.99, 59.50, np.nan, 249.00, 19.99, np.nan],
    'in_stock': [True, False, True, True, np.nan, False]
})

missing_counts = df.isnull().sum()
total_rows = len(df)
percent_missing_sum_method = (missing_counts / total_rows) * 100


missing_summary_df_sum = pd.DataFrame({
    'column_name': df.columns,
    'percent_missing': percent_missing_sum_method
})
# Resetting index if you want column_name as a regular column,
# otherwise the original column names will be the index.
# For this display, having original column names as index is fine.

print("Missing value percentages in a DataFrame (using .sum()):")
print(missing_summary_df_sum)
print()

# Alternative way to create the DataFrame directly from the Series:
missing_summary_df_sum_alt = pd.DataFrame(percent_missing_sum_method, columns=['percent_missing'])
missing_summary_df_sum_alt.index.name = 'column_name'   # Set index name
missing_summary_df_sum_alt.reset_index(inplace=True)    # Make column_name a regular column
print("Alternative DataFrame creation:")
print(missing_summary_df_sum_alt)

Output:

Missing value percentages in a DataFrame (using .sum()):
           column_name  percent_missing
product_id  product_id         0.000000
category      category        50.000000
price            price        33.333333
in_stock      in_stock        16.666667

Alternative DataFrame creation:
  column_name  percent_missing
0  product_id         0.000000
1    category        50.000000
2       price        33.333333
3    in_stock        16.666667

Sorting the Results by Percentage Missing

Use DataFrame.sort_values() to easily identify columns with the most missing data.

import pandas as pd
import numpy as np

# --- Defined as before ---
# df and percent_missing_sum_method defined as above
df = pd.DataFrame({
    'product_id': [101, 102, 103, 104, 105, 106],
    'category': ['Electronics', np.nan, 'Books', 'Electronics', np.nan, np.nan],
    'price': [199.99, 59.50, np.nan, 249.00, 19.99, np.nan],
    'in_stock': [True, False, True, True, np.nan, False]
})

missing_counts = df.isnull().sum()
total_rows = len(df)
percent_missing_sum_method = (missing_counts / total_rows) * 100


missing_summary_df_sum = pd.DataFrame({
    'column_name': df.columns,
    'percent_missing': percent_missing_sum_method
})
missing_summary_df_sum_alt = pd.DataFrame(percent_missing_sum_method, columns=['percent_missing'])
missing_summary_df_sum_alt.index.name = 'column_name'   # Set index name
missing_summary_df_sum_alt.reset_index(inplace=True)    # Make column_name a regular column
print("DataFrame creation:")
print(missing_summary_df_sum_alt)
print()
# --- --- ---

# Sort by 'percent_missing' in descending order
sorted_missing_sum = missing_summary_df_sum_alt.sort_values(by='percent_missing', ascending=False)

print("Sorted missing value percentages (using .sum()):")
print(sorted_missing_sum)

Output:

DataFrame creation:
  column_name  percent_missing
0  product_id         0.000000
1    category        50.000000
2       price        33.333333
3    in_stock        16.666667

Sorted missing value percentages (using .sum()):
  column_name  percent_missing
1    category        50.000000
2       price        33.333333
3    in_stock        16.666667
0  product_id         0.000000

Method 2: Using `isnull().mean()` for a More Direct Percentage

This method is more concise. When .mean() is applied to a boolean Series (or DataFrame), True is treated as 1 and False as 0. The mean of these 0s and 1s directly gives the proportion of True values (i.e., the proportion of missing values).

How `mean()` on a Boolean Series Calculates Proportion**

For a column, df['col'].isnull().mean() is equivalent to df['col'].isnull().sum() / len(df['col']).

Calculating and Formatting the Percentage**

import pandas as pd
import numpy as np

# df defined as before
df = pd.DataFrame({
    'product_id': [101, 102, 103, 104, 105, 106],
    'category': ['Electronics', np.nan, 'Books', 'Electronics', np.nan, np.nan],
    'price': [199.99, 59.50, np.nan, 249.00, 19.99, np.nan],
    'in_stock': [True, False, True, True, np.nan, False]
})

# Calculate proportion of missing values and multiply by 100
percent_missing_mean_method = df.isnull().mean() * 100

print("Percentage of missing values (using .mean()):")
print(percent_missing_mean_method)
print()

# You can also chain .round() for better formatting and .mul() for multiplication
percent_missing_formatted = df.isnull().mean().round(4).mul(100)
print("Formatted percentage (using .mean(), .round(), .mul()):")
print(percent_missing_formatted)

Output:

Percentage of missing values (using .mean()):
product_id     0.000000
category      50.000000
price         33.333333
in_stock      16.666667
dtype: float64

Formatted percentage (using .mean(), .round(), .mul()):
product_id     0.00
category      50.00
price         33.33
in_stock      16.67
dtype: float64

Presenting in a DataFrame and Sorting

This follows the same logic as with the .sum() method.

import pandas as pd
import numpy as np

# df and percent_missing_formatted defined as above
df = pd.DataFrame({
    'product_id': [101, 102, 103, 104, 105, 106],
    'category': ['Electronics', np.nan, 'Books', 'Electronics', np.nan, np.nan],
    'price': [199.99, 59.50, np.nan, 249.00, 19.99, np.nan],
    'in_stock': [True, False, True, True, np.nan, False]
})

# Calculate proportion of missing values and multiply by 100
percent_missing_mean_method = df.isnull().mean() * 100
percent_missing_formatted = df.isnull().mean().round(4).mul(100)


missing_summary_df_mean = pd.DataFrame(
    percent_missing_formatted,
    columns=['percent_missing']
).sort_values(by='percent_missing', ascending=False)

missing_summary_df_mean.index.name = 'column_name' # Optional: name the index
# To make 'column_name' a regular column instead of index:
# missing_summary_df_mean.reset_index(inplace=True)

print("Sorted missing value percentages in DataFrame (using .mean()):")
print(missing_summary_df_mean)

Output:

Sorted missing value percentages in DataFrame (using .mean()):
             percent_missing
column_name                 
category               50.00
price                  33.33
in_stock               16.67
product_id              0.00

Choosing Your Preferred Method

isnull().sum() * 100 / len(df) (Method 1): This approach is often more explicit and easier to understand for beginners as it follows the direct formula for percentage calculation (count of missing / total count * 100).
isnull().mean() * 100 (Method 2): This is more concise and idiomatic in Pandas once you understand that .mean() on a boolean Series gives the proportion of Trues. It's often slightly more performant for very large DataFrames.

Both methods achieve the same correct result. Choose the one that you find more readable and maintainable.

Conclusion

Calculating the percentage of missing values per column is a fundamental step in data quality assessment with Pandas. Both the df.isnull().sum() * 100 / len(df) and the more direct df.isnull().mean() * 100 approaches effectively provide this information. Presenting these percentages in a sorted DataFrame helps prioritize columns for data cleaning or imputation strategies, leading to more robust and reliable downstream analyses.

The Importance of Quantifying Missing Data​

Method 1: Using isnull().sum() for Percentage Calculation​

Detecting Missing Values with isnull()​

Calculating the Percentage​

Presenting Results in a New DataFrame​

Sorting the Results by Percentage Missing​

Method 2: Using isnull().mean() for a More Direct Percentage​

How mean() on a Boolean Series Calculates Proportion**​

Calculating and Formatting the Percentage**​

Presenting in a DataFrame and Sorting​

Choosing Your Preferred Method​

Conclusion​

Table of Contents

The Importance of Quantifying Missing Data

Method 1: Using `isnull().sum()` for Percentage Calculation

Detecting Missing Values with `isnull()`

Calculating the Percentage

Presenting Results in a New DataFrame

Sorting the Results by Percentage Missing

Method 2: Using `isnull().mean()` for a More Direct Percentage

How `mean()` on a Boolean Series Calculates Proportion**

Calculating and Formatting the Percentage**

Presenting in a DataFrame and Sorting

Choosing Your Preferred Method

Conclusion