Python Pandas: How to Split a DataFrame into Chunks
When working with large Pandas DataFrames, it's often necessary to split them into smaller, more manageable chunks. This can be for batch processing, distributing work, or simply for easier inspection. Pandas, often in conjunction with NumPy, provides several effective ways to divide a DataFrame into multiple smaller DataFrames.
This guide explains how to split a Pandas DataFrame into a specific number of chunks or into chunks of a specific number of rows, using methods like numpy.array_split
and DataFrame slicing.
Why Split a DataFrame?
- Memory Management: Processing very large DataFrames can consume significant memory. Splitting allows you to process data in smaller, memory-friendly pieces.
- Batch Processing: Many operations (e.g., writing to a database, making API calls) are more efficient or required to be done in batches.
- Parallel Processing: You can distribute chunks to different processes or threads for parallel computation (though libraries like Dask are often better suited for large-scale parallelism).
- Sampling/Subsetting: Creating smaller representative samples or subsets for testing or focused analysis.
- Iteration: When you need to perform an operation on sequential blocks of rows.
Example DataFrame: We'll use the following DataFrame for our examples:
import pandas as pd
import numpy as np # For array_split
data = {
'id': range(1, 11), # 10 rows
'product_name': [f'Product {chr(65+i)}' for i in range(10)],
'category': ['Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec'],
'price': np.random.randint(10, 100, 10) * 1.99
}
df = pd.DataFrame(data)
print("Original DataFrame (first 5 rows):")
print(df.head())
Output (example, prices will vary):
Original DataFrame (first 5 rows):
id product_name category price
0 1 Product A Elec 47.76
1 2 Product B Book 89.55
2 3 Product C Home 183.08
3 4 Product D Elec 71.64
4 5 Product E Book 61.69
Method 1: Splitting into N Equal(ish) Chunks using numpy.array_split()
(Recommended)
The numpy.array_split(ary, indices_or_sections)
function is a versatile way to split a NumPy array (and thus a Pandas DataFrame, which is built on NumPy arrays) into a specific number of nearly equal sub-arrays (or sub-DataFrames).
Installation
Ensure you have Pandas and NumPy installed:
pip install pandas numpy
# Or
pip3 install pandas numpy
How It Works
ary
: The array or DataFrame to be split.indices_or_sections
:- If an integer
N
, the array will be divided intoN
sub-arrays. If the array doesn't divide evenly, the first few sub-arrays will be slightly larger. - If a 1-D array of sorted integers, these integers indicate the points at which the array is split.
- If an integer
np.array_split()
returns a list of DataFrames (when a DataFrame is passed in).
Example Usage
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': range(1, 11),
'product_name': [f'Product {chr(65+i)}' for i in range(10)],
'category': ['Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec'],
'price': np.round(np.random.rand(10) * 100, 2)
})
# Split the DataFrame into 3 chunks
num_chunks = 3
list_of_dfs = np.array_split(df, num_chunks)
print(f"Splitting into {num_chunks} chunks using np.array_split():")
for i, chunk_df in enumerate(list_of_dfs):
print(f"\n--- Chunk {i+1} (shape: {chunk_df.shape}) ---")
print(chunk_df)
Output:
Splitting into 3 chunks using np.array_split():
--- Chunk 1 (shape: (4, 4)) ---
id product_name category price
0 1 Product A Elec 27.08
1 2 Product B Book 5.67
2 3 Product C Home 29.66
3 4 Product D Elec 90.35
--- Chunk 2 (shape: (3, 4)) ---
id product_name category price
4 5 Product E Book 75.51
5 6 Product F Home 10.64
6 7 Product G Elec 80.14
--- Chunk 3 (shape: (3, 4)) ---
id product_name category price
7 8 Product H Book 55.35
8 9 Product I Home 10.87
9 10 Product J Elec 60.72
If the number of rows (10
) is not perfectly divisible by num_chunks
(3
), np.array_split
distributes the rows as evenly as possible. The first len(df) % num_chunks
chunks will have one extra element.
Accessing Individual Chunks
Since np.array_split
returns a list of DataFrames, you can access them by index:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': range(1, 11),
'product_name': [f'Product {chr(65+i)}' for i in range(10)],
'category': ['Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec'],
'price': np.round(np.random.rand(10) * 100, 2)
})
# Split the DataFrame into 3 chunks
num_chunks = 3
list_of_dfs = np.array_split(df, num_chunks)
# accessing individual chunks
first_chunk = list_of_dfs[0]
second_chunk = list_of_dfs[1]
print("--- First Chunk ---")
print(first_chunk)
Output:
--- First Chunk ---
id product_name category price
0 1 Product A Elec 27.08
1 2 Product B Book 5.67
2 3 Product C Home 29.66
3 4 Product D Elec 90.35
Method 2: Splitting Every N Rows (Creating Chunks of a Specific Size)
This approach splits the DataFrame into chunks where each chunk has a maximum of N
rows (the last chunk might have fewer).
Using a for
Loop and Slicing
You can iterate through the DataFrame's length with a step size and use standard DataFrame slicing.
import pandas as pd
import numpy as np
import math # For math.ceil
df = pd.DataFrame({
'id': range(1, 11),
'product_name': [f'Product {chr(65+i)}' for i in range(10)],
'category': ['Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec'],
'price': np.round(np.random.rand(10) * 100, 2)
})
def split_df_every_n_rows_loop(dataframe, chunk_size):
"""Splits a DataFrame into chunks of `chunk_size` rows using a loop."""
list_of_chunks = []
num_full_chunks = len(dataframe) // chunk_size
total_chunks = math.ceil(len(dataframe) / chunk_size) # Or num_full_chunks + 1 if len % chunk_size != 0
for i in range(total_chunks):
start_index = i * chunk_size
end_index = start_index + chunk_size
# Slicing handles the last chunk correctly (doesn't go out of bounds)
list_of_chunks.append(dataframe[start_index:end_index])
return list_of_chunks
chunk_size = 3 # Split into chunks of 3 rows
chunks_by_size_loop = split_df_every_n_rows_loop(df, chunk_size)
print(f"\nSplitting into chunks of size {chunk_size} (loop):")
for i, chunk_df in enumerate(chunks_by_size_loop):
print(f"\n--- Chunk {i+1} (shape: {chunk_df.shape}) ---")
print(chunk_df)
Output:
Splitting into chunks of size 3 (loop):
--- Chunk 1 (shape: (3, 4)) ---
id product_name category price
0 1 Product A Elec 92.57
1 2 Product B Book 71.40
2 3 Product C Home 76.34
--- Chunk 2 (shape: (3, 4)) ---
id product_name category price
3 4 Product D Elec 85.83
4 5 Product E Book 10.12
5 6 Product F Home 98.13
--- Chunk 3 (shape: (3, 4)) ---
id product_name category price
6 7 Product G Elec 13.58
7 8 Product H Book 82.77
8 9 Product I Home 83.03
--- Chunk 4 (shape: (1, 4)) ---
id product_name category price
9 10 Product J Elec 12.24
dataframe[start_index:end_index]
: Standard Python/Pandas slicing. It gracefully handles cases whereend_index
exceeds the DataFrame length.
Using List Comprehension and Slicing (Concise)
A list comprehension can make the previous approach more compact.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': range(1, 11),
'product_name': [f'Product {chr(65+i)}' for i in range(10)],
'category': ['Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec', 'Book', 'Home', 'Elec'],
'price': np.round(np.random.rand(10) * 100, 2)
})
def split_df_every_n_rows_comp(dataframe, chunk_size):
"""Splits a DataFrame into chunks of `chunk_size` rows using list comprehension."""
return [dataframe[i:i + chunk_size] for i in range(0, len(dataframe), chunk_size)]
chunk_size = 4 # Split into chunks of 4 rows
chunks_by_size_comp = split_df_every_n_rows_comp(df, chunk_size)
print(f"\nSplitting into chunks of size {chunk_size} (list comprehension):")
for i, chunk_df in enumerate(chunks_by_size_comp):
print(f"\n--- Chunk {i+1} (shape: {chunk_df.shape}) ---")
print(chunk_df)
Example output structure for chunk_size = 4 on 10 rows:
Splitting into chunks of size 4 (list comprehension):
--- Chunk 1 (shape: (4, 4)) ---
id product_name category price
0 1 Product A Elec 68.42
1 2 Product B Book 96.00
2 3 Product C Home 56.66
3 4 Product D Elec 48.59
--- Chunk 2 (shape: (4, 4)) ---
id product_name category price
4 5 Product E Book 8.06
5 6 Product F Home 98.00
6 7 Product G Elec 49.23
7 8 Product H Book 94.92
--- Chunk 3 (shape: (2, 4)) ---
id product_name category price
8 9 Product I Home 29.33
9 10 Product J Elec 95.62
range(0, len(dataframe), chunk_size)
: Generates start indices for each chunk.
Note on DataFrame.iloc
for Slici
When using slicing for this purpose (df[start:end]
), Pandas implicitly uses position-based slicing similar to df.iloc[start:end]
, so you don't strictly need to use .iloc
unless you want to be extremely explicit about positional indexing or are combining row and column positional selection. For just row slicing by position, df[start:end]
is common and effective.
Choosing the Right Method
- To split into a specific number (
N
) of roughly equal chunks: Usenumpy.array_split(df, N)
. This is generally the easiest and most robust way for this scenario. It handles uneven divisions well. - To split into chunks of a specific maximum row size (
chunk_size
): Use the list comprehension approach ([df[i:i + chunk_size] for i in range(0, len(df), chunk_size)]
) or the equivalentfor
loop with slicing. This gives you control over the maximum size of each chunk.
Conclusion
Splitting a Pandas DataFrame into smaller chunks is a practical technique for managing large datasets or performing batch operations.
numpy.array_split(df, N)
is excellent for dividing a DataFrame intoN
approximately equal parts.- Slicing within a loop or list comprehension (e.g.,
df[i:i + chunk_size]
) is ideal when you need chunks of a specific maximum row count.
Both methods return a list of DataFrames, which you can then iterate over or access individually to perform your desired operations on each chunk.