Skip to main content

Python Pandas: How to Keep Index When Merging DataFrames

When merging two Pandas DataFrames using pd.merge() or DataFrame.merge(), the resulting DataFrame typically gets a new default integer index (0, 1, 2,...). However, there are common scenarios where you want to preserve the index of one of the original DataFrames (usually the "left" DataFrame) in the merged result.

This guide explains several common techniques to keep or restore the original index after merging Pandas DataFrames, primarily focusing on reset_index(), set_index(), and the DataFrame.join() method.

Understanding Default Index Behavior in Merges

By default, pd.merge(left_df, right_df, ...) performs a database-style join based on common columns (or specified on columns). The index of the input DataFrames is generally not used for alignment in the merge operation itself (unless left_on=left_df.index or left_index=True is used), and the resulting DataFrame gets a new default RangeIndex.

If df1 has a meaningful index you wish to preserve, this default behavior discards it.

Example DataFrames

import pandas as pd

# DataFrame 1 with a custom string index
data1 = {
'FiscalYear': [2020, 2021, 2022, 2023],
'Revenue': [1.5, 1.8, 2.1, 2.5], # In millions
'Region': ['North', 'South', 'North', 'East']
}

df1 = pd.DataFrame(data1, index=['FY20', 'FY21', 'FY22', 'FY23'])
df1.index.name = 'ReportPeriod'
print("Original df1 (with custom index 'ReportPeriod'):")
print(df1)
print()

# DataFrame 2 with a default integer index
data2 = {
'FiscalYear': [2020, 2021, 2022, 2023],
'Employees': [100, 110, 125, 130],
'MarketingSpend': [0.2, 0.25, 0.3, 0.35] # In millions
}

df2 = pd.DataFrame(data2)
print("Original df2 (with default index):")
print(df2)

Output:

Original df1 (with custom index 'ReportPeriod'):
FiscalYear Revenue Region
ReportPeriod
FY20 2020 1.5 North
FY21 2021 1.8 South
FY22 2022 2.1 North
FY23 2023 2.5 East

Original df2 (with default index):
FiscalYear Employees MarketingSpend
0 2020 100 0.20
1 2021 110 0.25
2 2022 125 0.30
3 2023 130 0.35

We want to merge these on 'FiscalYear' and keep df1's 'ReportPeriod' index.

This is a robust, multi-step approach:

  1. reset_index() on df1: Convert df1's current index ('ReportPeriod') into a regular column.
  2. merge(): Perform the merge using the common 'FiscalYear' column and the newly created index column from df1.
  3. set_index() on the merged result: Set the column that originated from df1's index back as the index of the merged DataFrame.
import pandas as pd

df1_example = pd.DataFrame({
'FiscalYear': [2020, 2021, 2022, 2023], 'Revenue': [1.5, 1.8, 2.1, 2.5]
}, index=pd.Index(['FY20', 'FY21', 'FY22', 'FY23'], name='ReportPeriod'))
df2_example = pd.DataFrame({
'FiscalYear': [2020, 2021, 2022, 2023], 'Employees': [100, 110, 125, 130]
})

# Step 1: Reset index of df1, making 'ReportPeriod' a column
df1_reset = df1_example.reset_index()
print("df1 after reset_index():")
print(df1_reset)
print()

# Step 2: Merge df1_reset with df2 on 'FiscalYear'
merged_df = pd.merge(df1_reset, df2_example, on='FiscalYear', how='left')
print("Merged DataFrame (before setting index back):")
print(merged_df)
print()

# Step 3: Set 'ReportPeriod' column back as the index
# ✅ Final result with original index preserved
df_final_preserved_index = merged_df.set_index('ReportPeriod')

print("Merged DataFrame with df1's index preserved:")
print(df_final_preserved_index)

# This can be chained:
# df_final_chained = df1.reset_index().merge(df2, on='FiscalYear', how='left').set_index('ReportPeriod')

Output:

df1 after reset_index():
ReportPeriod FiscalYear Revenue
0 FY20 2020 1.5
1 FY21 2021 1.8
2 FY22 2022 2.1
3 FY23 2023 2.5

Merged DataFrame (before setting index back):
ReportPeriod FiscalYear Revenue Employees
0 FY20 2020 1.5 100
1 FY21 2021 1.8 110
2 FY22 2022 2.1 125
3 FY23 2023 2.5 130

Merged DataFrame with df1's index preserved:
FiscalYear Revenue Employees
ReportPeriod
FY20 2020 1.5 100
FY21 2021 1.8 110
FY22 2022 2.1 125
FY23 2023 2.5 130
note

This method explicitly manages the index column during the merge process.

Method 2: Creating an "Index Copy" Column Before merge()

Before merging, create a new column in df1 that is a copy of its index. Then, after merging, set this copied column as the index.

import pandas as pd

df1_example = pd.DataFrame({
'FiscalYear': [2020, 2021, 2022, 2023], 'Revenue': [1.5, 1.8, 2.1, 2.5]
}, index=pd.Index(['FY20', 'FY21', 'FY22', 'FY23'], name='ReportPeriod'))
df2_example = pd.DataFrame({
'FiscalYear': [2020, 2021, 2022, 2023], 'Employees': [100, 110, 125, 130]
})

df1_with_index_copy = df1_example.copy() # Work on a copy

# Step 1: Create a new column from df1's index
df1_with_index_copy['OriginalIndex'] = df1_with_index_copy.index
print("df1 with 'OriginalIndex' column:")
print(df1_with_index_copy)
print()

# Step 2: Merge (df1 now has its original index as a regular column)
# Note: We merge df1_with_index_copy, which no longer has 'ReportPeriod' as index.
# If 'FiscalYear' isn't a column in df1_with_index_copy (because it *was* the index),
# you'd need to reset_index() first or merge on index.
# Assuming FiscalYear is a column in df1:
merged_df_index_copy = pd.merge(df1_with_index_copy.reset_index(), df2_example, on='FiscalYear', how='left')
# Here, reset_index() is used to ensure 'FiscalYear' from df1_example (if it was part of index before copy) is available for merge.
# And we also get the 'ReportPeriod' as a column.

# Step 3: Set the copied index column as the new index
# ✅ If original index name was 'ReportPeriod', it's available after reset_index()
df_final_index_copy = merged_df_index_copy.set_index('ReportPeriod')

print("Merged DataFrame using index copy method:")
# We might want to drop the temporary 0-based index if reset_index() created it without drop=True
# Or select columns to exclude any extra index column if OriginalIndex was different from ReportPeriod
print(df_final_index_copy[['FiscalYear', 'Revenue', 'Employees']]) # Show relevant columns

Output:

df1 with 'OriginalIndex' column:
FiscalYear Revenue OriginalIndex
ReportPeriod
FY20 2020 1.5 FY20
FY21 2021 1.8 FY21
FY22 2022 2.1 FY22
FY23 2023 2.5 FY23

Merged DataFrame using index copy method:
FiscalYear Revenue Employees
ReportPeriod
FY20 2020 1.5 100
FY21 2021 1.8 110
FY22 2022 2.1 125
FY23 2023 2.5 130
note

This method requires careful handling of which columns are used for merging if the original index was also a merge key. Method 1 is generally cleaner.

Method 3: Using DataFrame.join() (Joins on Index by Default or Specified Column)

The DataFrame.join() method is specifically designed to merge DataFrames based on their indexes, or on an index of one DataFrame and a column of another. This often preserves the calling DataFrame's index naturally.

import pandas as pd

df1_example = pd.DataFrame({
'FiscalYear': [2020, 2021, 2022, 2023], 'Revenue': [1.5, 1.8, 2.1, 2.5]
}, index=pd.Index(['FY20', 'FY21', 'FY22', 'FY23'], name='ReportPeriod'))
df2_example = pd.DataFrame({
'FiscalYear': [2020, 2021, 2022, 2023], 'Employees': [100, 110, 125, 130]
})

# To join df1 (caller) with df2:
# - df1 will join using its index ('ReportPeriod').
# - We need df2 to be indexed by the column we want to match with a column in df1.
# Let's assume we want to join on 'FiscalYear'.
# So, set 'FiscalYear' as index for df2 temporarily for the join.
# And specify `on='FiscalYear'` for df1 to use its 'FiscalYear' column.

# ✅ Perform the join
# df1 uses its 'FiscalYear' column, df2 uses its (newly set) 'FiscalYear' index.
# The index of df1 (the caller) is preserved.
df_joined = df1_example.join(df2_example.set_index('FiscalYear'), on='FiscalYear', how='left')

print("Merged DataFrame using .join():")
print(df_joined)
print()

print("Index of joined DataFrame:")
print(df_joined.index)

Output:

Merged DataFrame using .join():
FiscalYear Revenue Employees
ReportPeriod
FY20 2020 1.5 100
FY21 2021 1.8 110
FY22 2022 2.1 125
FY23 2023 2.5 130

Index of joined DataFrame:
Index(['FY20', 'FY21', 'FY22', 'FY23'], dtype='object', name='ReportPeriod')
  • df1.join(other_df, on='col_in_df1', lsuffix='_left', rsuffix='_right'):
    • other_df is joined to df1.
    • If on is specified, df1 uses that column for joining. other_df uses its index.
    • If on is not specified, both DataFrames join on their indexes.
  • In this case, df1_example.join(df2_example.set_index('FiscalYear'), on='FiscalYear') is powerful: df1_example keeps its 'ReportPeriod' index and uses its 'FiscalYear' column to match against the index of the modified df2_example (which we temporarily set to be its 'FiscalYear' column).

Important Considerations for Index Preservation

  • Uniqueness of Merge Keys: If the columns you merge on result in duplicate index values in the merged DataFrame (e.g., a one-to-many merge where df1's index value appears multiple times after merge), the set_index() step in Method 1 might raise an error if the resulting index is not unique. You might need to handle duplicates first (e.g., drop_duplicates()) or decide how to manage a non-unique index.
  • Type of Join (how parameter): The type of join (left, right, inner, outer) in pd.merge() or df.join() will affect which rows (and thus which original index values) are present in the final result.

Conclusion

When merging Pandas DataFrames and you need to preserve the index of one of them (typically the left DataFrame):

  1. The reset_index() -> merge() -> set_index() pattern (Method 1) is a clear and robust way to handle this explicitly when using pd.merge().
  2. The DataFrame.join(other.set_index('key_col'), on='key_col') method (Method 3) is often more concise and natural if your merge logic aligns well with index-based joining, as it preserves the calling DataFrame's index by default.
  3. Creating a temporary "index copy" column (Method 2) is also possible but can be less elegant.

Choose the method that best clarifies your intent and handles your specific merge keys and desired index structure. join() is often preferred when the primary join condition involves the index of at least one DataFrame.