Python Pandas: Calculating Mean (Average) Across Multiple DataFrames
In data analysis, you often work with multiple datasets that represent similar entities or observations over different periods or conditions. A common task is to consolidate these datasets and compute aggregate statistics, such as the mean (average), across all of them. For instance, you might want to find the average sales figures from several monthly reports or the mean sensor readings from different experimental runs.
This guide will comprehensively demonstrate how to calculate the mean across multiple Pandas DataFrames, covering both an overall mean for each column and a more granular row-wise mean based on matching indices. We'll explore techniques using pd.concat()
, groupby()
, and the stack()
/unstack()
methods to give you a complete toolkit.
The Goal: Averaging Data from Separate DataFrames
Imagine you have several Pandas DataFrames, each containing similar data (i.e., same columns and data types, representing comparable measurements). You might want to compute:
- Overall Mean: The average value for each column, considering all data points from all DataFrames combined.
- Row-wise Mean: The average value for each row, assuming the rows in different DataFrames correspond to each other (e.g., based on a common index).
Let's define two sample DataFrames for our examples:
import pandas as pd
df1 = pd.DataFrame({
'sensor_A': [10, 12, 15, 11, 13],
'sensor_B': [22, 20, 25, 23, 21]
})
df2 = pd.DataFrame({
'sensor_A': [11, 14, 13, 10, 15],
'sensor_B': [20, 21, 22, 24, 20]
})
df_short = pd.DataFrame({ # For illustrating different row counts
'sensor_A': [16, 18],
'sensor_B': [19, 22]
})
print("df1:\n", df1)
print()
print("df2:\n", df2)
Output:
df1:
sensor_A sensor_B
0 10 22
1 12 20
2 15 25
3 11 23
4 13 21
df2:
sensor_A sensor_B
0 11 20
1 14 21
2 13 22
3 10 24
4 15 20
Method 1: Overall Column Mean Across All DataFrames
This approach calculates the mean for each column by first combining all rows from all DataFrames into a single, larger DataFrame.
Concatenate the DataFrames
Use pandas.concat()
to stack the DataFrames vertically (along axis=0
, which is the default).
import pandas as pd
df1 = pd.DataFrame({
'sensor_A': [10, 12, 15, 11, 13],
'sensor_B': [22, 20, 25, 23, 21]
})
df2 = pd.DataFrame({
'sensor_A': [11, 14, 13, 10, 15],
'sensor_B': [20, 21, 22, 24, 20]
})
# Concatenate df1 and df2
df_combined = pd.concat([df1, df2]) # ignore_index=True can be added for a fresh index
print("Combined DataFrame (df1 and df2):")
print(df_combined)
Output:
Combined DataFrame (df1 and df2):
sensor_A sensor_B
0 10 22
1 12 20
2 15 25
3 11 23
4 13 21
0 11 20
1 14 21
2 13 22
3 10 24
4 15 20
Calculate the Mean
Once combined, call the .mean()
method on the resulting DataFrame. This will compute the mean for each numeric column.
import pandas as pd
df1 = pd.DataFrame({
'sensor_A': [10, 12, 15, 11, 13],
'sensor_B': [22, 20, 25, 23, 21]
})
df2 = pd.DataFrame({
'sensor_A': [11, 14, 13, 10, 15],
'sensor_B': [20, 21, 22, 24, 20]
})
# Concatenate df1 and df2
df_combined = pd.concat([df1, df2]) # ignore_index=True can be added for a fresh index
# Calculate the mean for each column in the combined DataFrame
overall_column_means = df_combined.mean()
print("Overall Column Means:")
print(overall_column_means)
Output:
Overall Column Means:
sensor_A 12.4
sensor_B 21.8
dtype: float64
This gives you the average of sensor_A
across all 10 values (5 from df1
, 5 from df2
), and similarly for sensor_B
.
Method 2: Row-wise Mean Across DataFrames (by Index)
This method calculates the mean for each row position (index) across the DataFrames. It assumes that rows with the same index in different DataFrames are comparable.
Using pd.concat()
and groupby(df.index)
or groupby(level=0)
First, concatenate the DataFrames. Then, group the combined DataFrame by its index and calculate the mean for each group (each unique index value).
import pandas as pd
df1 = pd.DataFrame({
'sensor_A': [10, 12, 15, 11, 13],
'sensor_B': [22, 20, 25, 23, 21]
})
df2 = pd.DataFrame({
'sensor_A': [11, 14, 13, 10, 15],
'sensor_B': [20, 21, 22, 24, 20]
})
df_combined_for_rowwise = pd.concat([df1, df2])
# Group by the index of the combined DataFrame
# For a simple RangeIndex, df_combined_for_rowwise.index is sufficient.
# If you had a MultiIndex, you might use level=0.
by_row_index = df_combined_for_rowwise.groupby(df_combined_for_rowwise.index)
# Calculate the mean for each group (each original row index)
rowwise_means = by_row_index.mean()
print("Row-wise Means (df1 and df2):")
print(rowwise_means)
Output:
Row-wise Means (df1 and df2):
sensor_A sensor_B
0 10.5 21.0
1 13.0 20.5
2 14.0 23.5
3 10.5 23.5
4 14.0 20.5
Handling DataFrames with Different Numbers of Rows
The groupby(df.index).mean()
approach gracefully handles cases where DataFrames have different numbers of rows. For indices present in only one DataFrame, the "mean" will simply be the value from that single DataFrame.
import pandas as pd
df1 = pd.DataFrame({
'sensor_A': [10, 12, 15, 11, 13],
'sensor_B': [22, 20, 25, 23, 21]
})
df2 = pd.DataFrame({
'sensor_A': [11, 14, 13, 10, 15],
'sensor_B': [20, 21, 22, 24, 20]
})
df_short = pd.DataFrame({ # For illustrating different row counts
'sensor_A': [16, 18],
'sensor_B': [19, 22]
})
df_combined_diff_len = pd.concat([df1, df_short])
print("Combined DataFrame (df1 and df_short):")
print(df_combined_diff_len)
print()
rowwise_means_diff_len = df_combined_diff_len.groupby(df_combined_diff_len.index).mean()
print("Row-wise Means (df1 and df_short):")
print(rowwise_means_diff_len)
Output:
Combined DataFrame (df1 and df_short):
sensor_A sensor_B
0 10 22
1 12 20
2 15 25
3 11 23
4 13 21
0 16 19
1 18 22
Row-wise Means (df1 and df_short):
sensor_A sensor_B
0 13.0 20.5
1 15.0 21.0
2 15.0 25.0
3 11.0 23.0
4 13.0 21.0
Method 3: Row-wise Mean Using stack()
and unstack()
(for two DataFrames with same shape)
If you have exactly two DataFrames with the same shape (same number of rows and columns, and matching indices), you can perform element-wise operations. For averaging, a more generalizable approach for multiple DataFrames (though shown here for two) involves stack()
and mean(level=...)
or by directly averaging.
A simpler direct average for two DataFrames with the same shape:
import pandas as pd
df1 = pd.DataFrame({
'sensor_A': [10, 12, 15, 11, 13],
'sensor_B': [22, 20, 25, 23, 21]
})
df2 = pd.DataFrame({
'sensor_A': [11, 14, 13, 10, 15],
'sensor_B': [20, 21, 22, 24, 20]
})
# Ensure indices align if they are meaningful and not just default RangeIndex
# For this example, we assume default RangeIndex or already aligned indices.
if df1.shape == df2.shape and df1.index.equals(df2.index):
rowwise_mean_direct = (df1 + df2) / 2
print("Row-wise Mean (df1 and df2, direct average):")
print(rowwise_mean_direct)
else:
print("DataFrames do not have the same shape or index for direct averaging.")
Output:
Row-wise Mean (df1 and df2, direct average):
sensor_A sensor_B
0 10.5 21.0
1 13.0 20.5
2 14.0 23.5
3 10.5 23.5
4 14.0 20.5
For a more general approach (e.g., more than two DataFrames) that is conceptually similar to the stack()
/unstack()
, but using pd.concat
with keys and mean(level=...)
:
import pandas as pd
df1 = pd.DataFrame({
'sensor_A': [10, 12, 15, 11, 13],
'sensor_B': [22, 20, 25, 23, 21]
})
df2 = pd.DataFrame({
'sensor_A': [11, 14, 13, 10, 15],
'sensor_B': [20, 21, 22, 24, 20]
})
# Concatenate with keys to identify original DataFrames
all_dfs = [df1, df2] # Add more DFs here if needed
panel_df = pd.concat(all_dfs, keys=['df1', 'df2']) # keys distinguish original DFs
print("Panel DataFrame (MultiIndex):")
print(panel_df)
print()
# Calculate mean across the original DataFrames (level 1 of the index is the original row index)
rowwise_mean_panel = panel_df.groupby(level=1).mean()
print("Row-wise Mean from Panel DataFrame:")
print(rowwise_mean_panel)
Output:
Panel DataFrame (MultiIndex):
sensor_A sensor_B
df1 0 10 22
1 12 20
2 15 25
3 11 23
4 13 21
df2 0 11 20
1 14 21
2 13 22
3 10 24
4 15 20
Row-wise Mean from Panel DataFrame:
sensor_A sensor_B
0 10.5 21.0
1 13.0 20.5
2 14.0 23.5
3 10.5 23.5
4 14.0 20.5
This pd.concat
with keys
approach is robust for multiple DataFrames of the same column structure, even if row counts differ (though groupby(level=1)
assumes original indices are comparable or you want to average by original position).
The (df1.stack() + df2.stack()) / 2
followed by .unstack()
is a clever way for exactly two DataFrames with aligned indices, but the pd.concat
with keys
or groupby(df.index)
methods are generally more flexible for multiple DataFrames or varying row counts.
Extension: Calculating Median Across Multiple DataFrames
The same principles apply if you want to calculate the median (or any other aggregate function like sum
, min
, max
, std
). Simply replace .mean()
with .median()
.
import pandas as pd
df1 = pd.DataFrame({
'sensor_A': [10, 12, 15, 11, 13],
'sensor_B': [22, 20, 25, 23, 21]
})
df2 = pd.DataFrame({
'sensor_A': [11, 14, 13, 10, 15],
'sensor_B': [20, 21, 22, 24, 20]
})
df_combined = pd.concat([df1, df2])
# Overall Column Medians
overall_column_medians = df_combined.median()
print("Overall Column Medians:")
print(overall_column_medians)
print()
# Row-wise Medians (using the result from Method 2)
by_row_index_median = df_combined.groupby(df_combined.index).median()
print("Row-wise Medians:")
print(by_row_index_median)
Output:
Overall Column Medians:
sensor_A 12.5
sensor_B 21.5
dtype: float64
Row-wise Medians:
sensor_A sensor_B
0 10.5 21.0
1 13.0 20.5
2 14.0 23.5
3 10.5 23.5
4 14.0 20.5
Choosing the Right Method
- Overall Column Mean/Median (Method 1): Use when you want a single average/median value for each column, treating all data from all DataFrames as one large dataset.
- Row-wise Mean/Median by Index (Method 2 & 4 with
groupby
): Use when rows at the same index position across DataFrames are comparable, and you want an average/median for each such corresponding set of rows. This is very flexible. - Direct Element-wise Average (Simplified part of Method 4): Suitable only for exactly two DataFrames of identical shape and aligned indices.
Conclusion
Pandas offers flexible and powerful ways to calculate means (and other statistics) across multiple DataFrames. By first concatenating the DataFrames using pd.concat()
, you can then apply .mean()
(or .median()
, etc.) to either the entire combined dataset for overall column statistics or use groupby(df.index)
(or groupby(level=...)
with keys) for more granular row-wise calculations based on matching indices. Understanding these techniques allows you to effectively aggregate and summarize information from distributed datasets.