Python Pandas: How to Merge Only Specific DataFrame Columns
When merging two Pandas DataFrames, you often don't need to bring in all columns from both DataFrames into the final result. Pandas merge()
functionality allows you to selectively choose which columns from each DataFrame participate in the merge and which ones are carried over into the resulting merged DataFrame.
This guide explains how to merge only specific columns from one or both DataFrames, ensuring your merged DataFrame contains precisely the information you need.
The Goal: Selective Column Merging
Given two Pandas DataFrames, df1
(left) and df2
(right), that share one or more common "key" columns for joining, we want to create a new merged DataFrame that includes:
- All columns from
df1
. - Only a specific subset of columns from
df2
. Or, more generally, a specific subset of columns from bothdf1
anddf2
.
Example DataFrames
import pandas as pd
# DataFrame 1 (e.g., Sales Data)
data1 = {
'OrderID': [101, 102, 103, 104, 105],
'ProductID': ['P20', 'P21', 'P20', 'P22', 'P21'],
'Quantity': [2, 1, 5, 3, 2],
'SaleDate': pd.to_datetime(['2025-01-15', '2025-01-16', '2025-01-17', '2025-01-18', '2025-01-19'])
}
df1 = pd.DataFrame(data1)
print("Original df1 (Sales Data):")
print(df1)
print()
# DataFrame 2 (e.g., Product Details)
data2 = {
'ProductID': ['P20', 'P21', 'P22', 'P23', 'P24'],
'ProductName': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
'Category': ['Electronics', 'Accessory', 'Accessory', 'Electronics', 'Accessory'],
'Supplier': ['SupplierX', 'SupplierY', 'SupplierX', 'SupplierZ', 'SupplierY'],
'Cost': [800, 15, 45, 200, 30]
}
df2 = pd.DataFrame(data2)
print("Original df2 (Product Details):")
print(df2)
Output:
Original df1 (Sales Data):
OrderID ProductID Quantity SaleDate
0 101 P20 2 2025-01-15
1 102 P21 1 2025-01-16
2 103 P20 5 2025-01-17
3 104 P22 3 2025-01-18
4 105 P21 2 2025-01-19
Original df2 (Product Details):
ProductID ProductName Category Supplier Cost
0 P20 Laptop Electronics SupplierX 800
1 P21 Mouse Accessory SupplierY 15
2 P22 Keyboard Accessory SupplierX 45
3 P23 Monitor Electronics SupplierZ 200
4 P24 Webcam Accessory SupplierY 30
We want to merge these on 'ProductID', bringing all columns from df1
but only 'ProductName' and 'Category' from df2
.
Method 1: Subsetting the "Right" DataFrame Before Merging (Recommended)
The most straightforward way is to select the desired columns from the right DataFrame (df2
) before passing it to the merge()
method.
Selecting Specific Columns from the Right DataFrame
Create a subset of df2
containing only the merge key(s) and the other columns you want to include from df2
.
import pandas as pd
df1_example = pd.DataFrame({
'OrderID': [101, 102], 'ProductID': ['P20', 'P21'], 'Quantity': [2, 1]
})
df2_example = pd.DataFrame({
'ProductID': ['P20', 'P21', 'P22'],
'ProductName': ['Laptop', 'Mouse', 'Keyboard'],
'Category': ['Electronics', 'Accessory', 'Accessory'],
'Supplier': ['X', 'Y', 'X']
})
# Define the merge key and the specific columns to bring from df2
merge_key = 'ProductID'
columns_from_df2 = ['ProductName', 'Category'] # Columns we want from df2
# ✅ Create a subset of df2: includes the merge key AND the desired columns
df2_subset = df2_example[[merge_key] + columns_from_df2]
print("Subset of df2 to be merged:")
print(df2_subset)
print()
# Merge df1 with this subset of df2
merged_df_specific_from_right = pd.merge(
df1_example,
df2_subset,
on=merge_key,
how='left' # or 'inner', etc.
)
print("Merged DataFrame with specific columns from df2:")
print(merged_df_specific_from_right)
Output:
Subset of df2 to be merged:
ProductID ProductName Category
0 P20 Laptop Electronics
1 P21 Mouse Accessory
2 P22 Keyboard Accessory
Merged DataFrame with specific columns from df2:
OrderID ProductID Quantity ProductName Category
0 101 P20 2 Laptop Electronics
1 102 P21 1 Mouse Accessory
df2_example[[merge_key] + columns_from_df2]
: This creates a new DataFrame fromdf2_example
containing only the 'ProductID' (the merge key) and the 'ProductName' and 'Category' columns.- The merge operation then proceeds, and only these selected columns from
df2_subset
are added todf1_example
.
Selecting Specific Columns from Both DataFrames
If you also want to use only a subset of columns from the left DataFrame (df1
), apply the same subsetting logic to it.
import pandas as pd
df1_example = pd.DataFrame({
'OrderID': [101, 102], 'ProductID': ['P20', 'P21'], 'Quantity': [2, 1], 'Region': ['N', 'S']
})
df2_example = pd.DataFrame({
'ProductID': ['P20', 'P21'], 'ProductName': ['Laptop', 'Mouse'], 'Supplier': ['X', 'Y']
})
merge_key = 'ProductID'
cols_from_df1_to_merge = ['OrderID', 'ProductID', 'Quantity'] # Keep these from df1
cols_from_df2_to_merge = ['ProductName'] # Keep this from df2 (plus merge key)
# Create subsets of both DataFrames
df1_subset_for_merge = df1_example[cols_from_df1_to_merge]
df2_subset_for_merge = df2_example[[merge_key] + cols_from_df2_to_merge]
# Merge the subsets
merged_df_specific_both = pd.merge(
df1_subset_for_merge,
df2_subset_for_merge,
on=merge_key,
how='left'
)
print("Merged DataFrame with specific columns from both df1 and df2:")
print(merged_df_specific_both)
Output:
Merged DataFrame with specific columns from both df1 and df2:
OrderID ProductID Quantity ProductName
0 101 P20 2 Laptop
1 102 P21 1 Mouse
Method 2: Merging All then Dropping Unwanted Columns
An alternative is to merge all (or more than needed) columns initially and then use DataFrame.drop()
to remove the columns you don't want in the final result.
import pandas as pd
df1_example = pd.DataFrame({'OrderID': [101], 'ProductID': ['P20']})
df2_example = pd.DataFrame({
'ProductID': ['P20'], 'ProductName': ['Laptop'],
'Category': ['Electronics'], 'Supplier': ['X'], 'Cost': [800]
})
merge_key = 'ProductID'
# Step 1: Merge, potentially bringing in all columns from df2 initially
# For this example, let's assume df2 has many columns.
# df_merged_all = pd.merge(df1, df2, on=merge_key, how='left')
# In this case, df2_example is small, so let's merge with it directly
df_merged_temp = pd.merge(df1_example, df2_example, on=merge_key, how='left')
print("Temporarily merged DataFrame (all df2 cols):")
print(df_merged_temp)
print()
# Step 2: Define columns to drop from the merged result
# These would be columns from df2 that we didn't want (e.g., 'Supplier', 'Cost')
columns_to_remove_after_merge = ['Supplier', 'Cost']
# ✅ Drop the unwanted columns
df_final_dropped = df_merged_temp.drop(columns=columns_to_remove_after_merge)
print("Merged DataFrame after dropping unwanted columns:")
print(df_final_dropped)
Output:
Temporarily merged DataFrame (all df2 cols):
OrderID ProductID ProductName Category Supplier Cost
0 101 P20 Laptop Electronics X 800
Merged DataFrame after dropping unwanted columns:
OrderID ProductID ProductName Category
0 101 P20 Laptop Electronics
This method can be useful if the list of columns to drop is shorter or easier to define than the list of columns to keep, or if some columns from df2
are needed for intermediate steps before being dropped. However, subsetting before the merge (Method 1) is often more memory and computationally efficient if many columns from df2
are not needed at all.
Using DataFrame.loc
for Subsetting Before Merging (Alternative to Direct Slicing)
Instead of df2_subset = df2[[merge_key] + columns_from_df2]
, you can use .loc
to select columns. The result is the same.
import pandas as pd
df1_example = pd.DataFrame({'OrderID': [101], 'ProductID': ['P20']})
df2_example = pd.DataFrame({
'ProductID': ['P20'], 'ProductName': ['Laptop'], 'Category': ['Electronics']
})
merge_key = 'ProductID'
columns_from_df2 = ['ProductName', 'Category']
# ✅ Select columns from df2 using .loc
# The first ':' means all rows.
df2_subset_loc = df2_example.loc[:, [merge_key] + columns_from_df2]
merged_df_loc = pd.merge(df1_example, df2_subset_loc, on=merge_key, how='left')
print("Merged using .loc for subsetting df2:")
print(merged_df_loc)
Output:
Merged using .loc for subsetting df2:
OrderID ProductID ProductName Category
0 101 P20 Laptop Electronics
This is just a different syntax for selecting the subset of df2
prior to merging.
Key merge()
Parameters (on
, how
)
Remember to correctly specify:
on
: The column name(s) to join on (must be present in both DataFrames). Useleft_on
andright_on
if key column names differ.how
: Type of merge to be performed:'left'
: Keeps all keys fromdf1
.'right'
: Keeps all keys fromdf2
.'outer'
: Keeps all keys from both DataFrames.'inner'
(default): Keeps only keys common to both DataFrames.
Conclusion
To merge only specific columns from one Pandas DataFrame (df2
) into another (df1
):
- The recommended and most efficient approach is to create a subset of
df2
containing only the merge key(s) and the desired columns before callingpd.merge()
:df2_selected_cols = df2[['merge_key_col'] + ['col_to_keep1', 'col_to_keep2']]
merged = pd.merge(df1, df2_selected_cols, on='merge_key_col', how='left') - You can apply the same subsetting logic to
df1
if you only want specific columns from it as well. - Alternatively, merge more columns than needed and then use
merged_df.drop(columns=['unwanted_col1', ...])
to remove the excess columns from the result. This is generally less efficient if many columns are to be dropped.
By selectively choosing columns before or after the merge, you can construct a final DataFrame that precisely contains the data required for your analysis.