Python Pandas: How to Get DataFrame Memory Usage
Understanding the memory footprint of your Pandas DataFrames is crucial, especially when working with large datasets, as it can impact performance and resource allocation. Pandas provides several methods to inspect the memory usage of a DataFrame, both for individual columns and for the entire object.
This guide explains how to use DataFrame.memory_usage()
, sys.getsizeof()
, and DataFrame.info()
to accurately determine the memory size of your DataFrames.
Why Check DataFrame Memory Usage?
- Performance Optimization: Large DataFrames can slow down operations. Knowing memory usage helps identify bottlenecks.
- Resource Management: Essential when working in memory-constrained environments (e.g., cloud functions, smaller machines).
- Data Type Selection: Understanding memory impact can guide choices for more memory-efficient data types (e.g.,
int32
vs.int64
,category
vs.object
). - Debugging: Unexpectedly high memory usage can indicate issues like data duplication or inefficient data structures.
Example DataFrame
import pandas as pd
import numpy as np # For np.nan
import sys # For sys.getsizeof
data = {
'EmployeeID': [101, 102, 103, 104, 105],
'FullName': ['Alice Wonderland', 'Robert "Bob" Johnson', 'Charles Xavier', 'Diana Prince', 'Edward Nygma'],
'Department': ['HR', 'Engineering', 'Management', 'HR', 'Research'],
'StartDate': pd.to_datetime(['2020-01-15', '2019-03-01', '2021-06-10', np.nan, '2020-08-20']),
'Salary': [60000, 85000, 120000, 62000, 75000],
'IsFullTime': [True, True, False, True, True]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print()
print("Original dtypes:")
print(df.dtypes)
Output:
Original DataFrame:
EmployeeID FullName Department StartDate Salary IsFullTime
0 101 Alice Wonderland HR 2020-01-15 60000 True
1 102 Robert "Bob" Johnson Engineering 2019-03-01 85000 True
2 103 Charles Xavier Management 2021-06-10 120000 False
3 104 Diana Prince HR NaT 62000 True
4 105 Edward Nygma Research 2020-08-20 75000 True
Original dtypes:
EmployeeID int64
FullName object
Department object
StartDate datetime64[ns]
Salary int64
IsFullTime bool
dtype: object
Method 1: DataFrame.memory_usage()
(Detailed and Recommended)
The DataFrame.memory_usage(index=True, deep=False)
method returns a Pandas Series where the index is the column names (and optionally the DataFrame's index itself) and the values are the memory usage of each component in bytes.
Getting Memory Usage per Column
import pandas as pd
df_example = pd.DataFrame({
'EmployeeID': [101, 102, 103, 104, 105],
'FullName': ['Alice Wonderland', 'Robert "Bob" Johnson', 'Charles Xavier', 'Diana Prince', 'Edward Nygma'],
'Department': ['HR', 'Engineering', 'Management', 'HR', 'Research'],
'Salary': [60000, 85000, 120000, 62000, 75000],
})
# Get memory usage for each column, including the index by default
memory_per_column_with_index = df_example.memory_usage() # index=True by default
print("Memory usage per column (including index, deep=False by default):")
print(memory_per_column_with_index)
Output:
Memory usage per column (including index, deep=False by default):
Index 72
EmployeeID 40
FullName 20
Department 20
Salary 40
dtype: int64
By default (deep=False
), for columns with dtype='object'
(like strings), this only reports the memory taken by the pointers to the string objects, not the actual memory consumed by the string data itself.
Including/Excluding Index Memory
The index
parameter controls whether the memory usage of the DataFrame's index is included.
import pandas as pd
df_example = pd.DataFrame({
'EmployeeID': [101, 102, 103, 104, 105],
'FullName': ['Alice Wonderland', 'Robert "Bob" Johnson', 'Charles Xavier', 'Diana Prince', 'Edward Nygma'],
'Department': ['HR', 'Engineering', 'Management', 'HR', 'Research'],
'Salary': [60000, 85000, 120000, 62000, 75000],
})
# Exclude index memory
memory_per_column_no_index = df_example.memory_usage(index=False)
print("Memory usage per column (excluding index):")
print(memory_per_column_no_index)
Output:
Memory usage per column (excluding index):
EmployeeID 40
FullName 20
Department 20
Salary 40
dtype: int32
Getting Total DataFrame Memory (.sum()
)
To get the total memory usage of the DataFrame, call .sum()
on the Series returned by memory_usage()
.
import pandas as pd
df_example = pd.DataFrame({
'EmployeeID': [101, 102, 103, 104, 105],
'FullName': ['Alice Wonderland', 'Robert "Bob" Johnson', 'Charles Xavier', 'Diana Prince', 'Edward Nygma'],
'Department': ['HR', 'Engineering', 'Management', 'HR', 'Research'],
'Salary': [60000, 85000, 120000, 62000, 75000],
})
# Total memory including index, but shallow for object columns
total_memory_shallow = df_example.memory_usage(index=True).sum()
print(f"Total memory (shallow, including index): {total_memory_shallow} bytes")
Output:
Total memory (shallow, including index): 192 bytes
Accurate Memory for Object Dtypes (deep=True
)
To get a more accurate memory count that includes the actual memory used by the Python objects within object
dtype columns (like strings), set deep=True
.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'EmployeeID': [101, 102, 103, 104, 105],
'FullName': ['Alice Wonderland', 'Robert "Bob" Johnson', 'Charles Xavier', 'Diana Prince', 'Edward Nygma'],
'Department': ['HR', 'Engineering', 'Management', 'HR', 'Research'],
'StartDate': pd.to_datetime(['2020-01-15', '2019-03-01', '2021-06-10', np.nan, '2020-08-20']),
'Salary': [60000, 85000, 120000, 62000, 75000],
'IsFullTime': [True, True, False, True, True]
})
# Memory usage per column with deep inspection
memory_deep_per_column = df.memory_usage(deep=True)
print("Memory usage per column (deep=True):")
print(memory_deep_per_column)
print()
# Total memory with deep inspection
total_memory_deep = df.memory_usage(index=True, deep=True).sum()
print(f"Total memory (deep=True, including index): {total_memory_deep} bytes")
Output:
Memory usage per column (deep=True):
Index 72
EmployeeID 40
FullName 199
Department 158
StartDate 40
Salary 40
IsFullTime 5
dtype: int64
Total memory (deep=True, including index): 554 bytes
Using deep=True
provides a much more realistic estimate of memory usage when your DataFrame contains string columns or other Python objects.
Method 2: DataFrame.info()
(Quick Summary)
The DataFrame.info()
method prints a concise summary of the DataFrame, including the data types, non-null counts, and memory usage.
Basic Memory Information
By default, info()
provides a shallow memory estimate.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'EmployeeID': [101, 102, 103, 104, 105],
'FullName': ['Alice Wonderland', 'Robert "Bob" Johnson', 'Charles Xavier', 'Diana Prince', 'Edward Nygma'],
'Department': ['HR', 'Engineering', 'Management', 'HR', 'Research'],
'StartDate': pd.to_datetime(['2020-01-15', '2019-03-01', '2021-06-10', np.nan, '2020-08-20']),
'Salary': [60000, 85000, 120000, 62000, 75000],
'IsFullTime': [True, True, False, True, True]
})
print("DataFrame.info() (default memory_usage):")
df.info()
Output:
DataFrame.info() (default memory_usage):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 EmployeeID 5 non-null int64
1 FullName 5 non-null object
2 Department 5 non-null object
3 StartDate 4 non-null datetime64[ns]
4 Salary 5 non-null int64
5 IsFullTime 5 non-null bool
dtypes: bool(1), datetime64[ns](1), int64(2), object(2)
memory usage: 237.0+ bytes
Deep Memory Calculation with info()
You can get the deep memory calculation by passing memory_usage='deep'
to info()
.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'EmployeeID': [101, 102, 103, 104, 105],
'FullName': ['Alice Wonderland', 'Robert "Bob" Johnson', 'Charles Xavier', 'Diana Prince', 'Edward Nygma'],
'Department': ['HR', 'Engineering', 'Management', 'HR', 'Research'],
'StartDate': pd.to_datetime(['2020-01-15', '2019-03-01', '2021-06-10', np.nan, '2020-08-20']),
'Salary': [60000, 85000, 120000, 62000, 75000],
'IsFullTime': [True, True, False, True, True]
})
print("DataFrame.info(memory_usage='deep'):")
df.info(memory_usage='deep')
Output:
DataFrame.info(memory_usage='deep'):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 EmployeeID 5 non-null int64
1 FullName 5 non-null object
2 Department 5 non-null object
3 StartDate 4 non-null datetime64[ns]
4 Salary 5 non-null int64
5 IsFullTime 5 non-null bool
dtypes: bool(1), datetime64[ns](1), int64(2), object(2)
memory usage: 554.0 bytes
Method 3: sys.getsizeof()
(Base Object Size - Less Accurate for Contents)
Python's built-in sys.getsizeof()
returns the size of the Python object itself in memory (e.g., the DataFrame shell), but it does not recursively sum the size of the objects it contains (like the underlying NumPy arrays or the individual strings in object columns) unless those objects are very simple. For DataFrames, sys.getsizeof()
typically gives a value close to df.memory_usage(deep=False).sum()
but might not be exactly the same and is generally less informative than Pandas' own methods.
import sys
import pandas as pd
df = pd.DataFrame({
'EmployeeID': [101, 102, 103, 104, 105],
'FullName': ['Alice Wonderland', 'Robert "Bob" Johnson', 'Charles Xavier', 'Diana Prince', 'Edward Nygma'],
'Department': ['HR', 'Engineering', 'Management', 'HR', 'Research'],
'StartDate': pd.to_datetime(['2020-01-15', '2019-03-01', '2021-06-10', np.nan, '2020-08-20']),
'Salary': [60000, 85000, 120000, 62000, 75000],
'IsFullTime': [True, True, False, True, True]
})
size_sys = sys.getsizeof(df)
size_pandas_shallow = df.memory_usage(index=True, deep=False).sum()
size_pandas_deep = df.memory_usage(index=True, deep=True).sum()
print(f"sys.getsizeof(df): {size_sys} bytes")
print(f"df.memory_usage(deep=False).sum(): {size_pandas_shallow} bytes")
print(f"df.memory_usage(deep=True).sum(): {size_pandas_deep} bytes")
Output:
sys.getsizeof(df): 570 bytes
df.memory_usage(deep=False).sum(): 237 bytes
df.memory_usage(deep=True).sum(): 554 bytes
sys.getsizeof()
is generally not the preferred method for detailed DataFrame memory analysis.sys.getsizeof()
result can be larger than pandasdeep=False
because it includes the Python object overhead of the DataFrame itself and its attributes, not just the data arrays.
Interpreting Memory Usage Values
- Values are in bytes. Divide by 1024 for kilobytes (KB), by 10242 for megabytes (MB), etc.
- For
object
dtype columns,deep=False
is a significant underestimate.deep=True
is more accurate. - Memory usage can change based on data types. Optimizing dtypes (e.g., using
category
for repetitive strings,int32
instead ofint64
if values fit) can drastically reduce memory.
Conclusion
To effectively determine the memory size of a Pandas DataFrame:
- Use
DataFrame.memory_usage(deep=True).sum()
for the most accurate total memory footprint, especially with string/object columns. - Use
DataFrame.memory_usage(deep=True)
(without.sum()
) to see the detailed memory usage per column (and index). - Use
DataFrame.info(memory_usage='deep')
for a quick, human-readable summary that includes the deep memory calculation. sys.getsizeof()
is generally less useful for understanding the memory consumed by the DataFrame's contents.
By monitoring memory usage, you can make informed decisions about data storage, processing strategies, and data type optimization in your Pandas workflows.