Python Pandas: How to Find Common Rows (Intersection) Between Two DataFrames
In data analysis and manipulation, a common task is to identify records or rows that are present in two different datasets. This "intersection" of DataFrames helps in finding overlapping information, validating data consistency, or preparing data for combined analysis. Pandas, with its powerful merge()
function, provides an efficient and SQL-like way to perform this operation.
This guide will walk you through the process of finding common rows between two Pandas DataFrames, focusing on the pd.merge()
method with how='inner'
. You'll learn how to specify single or multiple columns for matching and understand the key parameters that control the merge behavior.
The Goal: Identifying Overlapping Records Across DataFrames
Imagine you have two DataFrames, perhaps from different sources or time periods, and you need to find out which rows (based on certain identifying columns) exist in both of them. This is akin to finding the intersection of two sets of data.
Let's define two sample DataFrames for our examples:
import pandas as pd
df1 = pd.DataFrame({
'student_id': ['S101', 'S102', 'S103', 'S104'],
'course_name': ['Math', 'Science', 'History', 'Math'],
'grade': [85, 90, 78, 92]
})
df2 = pd.DataFrame({
'student_id': ['S102', 'S104', 'S105', 'S101'],
'course_name': ['Science', 'Art', 'Physics', 'Math'],
'attendance': [0.95, 0.88, 0.92, 0.98]
})
print("DataFrame 1 (df1):")
print(df1)
print()
print("DataFrame 2 (df2):")
print(df2)
Output:
Common rows based on 'student_id':
student_id course_name_x grade course_name_y attendance
0 S101 Math 85 Math 0.98
1 S102 Science 90 Science 0.95
2 S104 Math 92 Art 0.88
The Pandas Solution: pd.merge()
with how='inner'
The primary way to find the intersection of rows in Pandas is by using the pd.merge()
function with the how='inner'
argument. An inner merge returns only the rows where the key(s) specified in the on
parameter exist in both DataFrames.
Finding Common Rows Based on a Single Key Column
Let's find rows that are common based on the student_id
column.
import pandas as pd
df1 = pd.DataFrame({
'student_id': ['S101', 'S102', 'S103', 'S104'],
'course_name': ['Math', 'Science', 'History', 'Math'],
'grade': [85, 90, 78, 92]
})
df2 = pd.DataFrame({
'student_id': ['S102', 'S104', 'S105', 'S101'],
'course_name': ['Science', 'Art', 'Physics', 'Math'],
'attendance': [0.95, 0.88, 0.92, 0.98]
})
# Find common rows based on the 'student_id' column
common_rows_by_id = pd.merge(df1, df2, how='inner', on=['student_id'])
# You can also use on='student_id' if it's a single string
print("Common rows based on 'student_id':")
print(common_rows_by_id)
Output:
Common rows based on 'student_id':
student_id course_name_x grade course_name_y attendance
0 S101 Math 85 Math 0.98
1 S102 Science 90 Science 0.95
2 S104 Math 92 Art 0.88
The result includes rows where student_id
values (S101
, S102
, S104
) were present in both df1
and df2
.
Understanding Merge Suffixes (e.g., _x
, _y
)
In the output above, notice course_name_x
and course_name_y
. When pd.merge()
joins DataFrames, if there are columns with the same name in both DataFrames that are not part of the join keys (specified in on
), Pandas automatically appends suffixes (_x
for the left DataFrame, _y
for the right DataFrame) to differentiate them. You can customize these suffixes using the suffixes=('_left_suffix', '_right_suffix')
argument in pd.merge()
.
Key pd.merge()
Parameters: how
, on
, left_on
, right_on
left
,right
: The two DataFrames to merge.how='inner'
: This is crucial for finding the intersection. It uses only the keys that exist in both DataFrames. Other options include'left'
,'right'
, and'outer'
.on
: A column name (string) or a list of column names to join on. These columns must exist in both DataFrames.left_on
,right_on
: Use these if the key column(s) have different names in the left and right DataFrames. For example,pd.merge(df1, df2, how='inner', left_on='student_id', right_on='pupil_id')
.suffixes=('_x', '_y')
: A tuple of strings to append to overlapping column names that are not join keys.
Finding Common Rows Based on Multiple Key Columns
To consider rows common only if values in multiple specified columns match, provide a list of these column names to the on
parameter.
import pandas as pd
df1 = pd.DataFrame({
'student_id': ['S101', 'S102', 'S103', 'S104'],
'course_name': ['Math', 'Science', 'History', 'Math'],
'grade': [85, 90, 78, 92]
})
df2 = pd.DataFrame({
'student_id': ['S102', 'S104', 'S105', 'S101'],
'course_name': ['Science', 'Art', 'Physics', 'Math'],
'attendance': [0.95, 0.88, 0.92, 0.98]
})
# Find common rows where both 'student_id' AND 'course_name' match
common_rows_by_id_and_course = pd.merge(
df1,
df2,
how='inner',
on=['student_id', 'course_name'] # Match on both columns
)
print("Common rows based on 'student_id' AND 'course_name':")
print(common_rows_by_id_and_course)
Output:
Common rows based on 'student_id' AND 'course_name':
student_id course_name grade attendance
0 S101 Math 85 0.98
1 S102 Science 90 0.95
In this case, student_id='S104'
is no longer in the result because while S104
exists in both, the course_name
for S104
is 'Math' in df1
and 'Art' in df2
, so they don't match on this second key. The columns grade
and attendance
do not have suffixes because course_name
is now part of the join key and the remaining columns (grade
from df1
, attendance
from df2
) are unique to their original DataFrames.
Implicit Intersection: Merging on All Common Column Names
If you call pd.merge(df1, df2, how='inner')
without specifying the on
, left_on
, or right_on
parameters, Pandas will implicitly use all column names that are common to both DataFrames as the join keys.
import pandas as pd
df_a = pd.DataFrame({
'key1': ['K0', 'K1', 'K2'],
'key2': ['A', 'B', 'C'],
'value_a': [1, 2, 3]
})
df_b = pd.DataFrame({
'key1': ['K0', 'K1', 'K3'],
'key2': ['A', 'X', 'C'], # Note: 'B' in df_a vs 'X' in df_b for K1
'value_b': [4, 5, 6]
})
# Merge on all common columns ('key1', 'key2') because 'on' is not specified
implicit_common = pd.merge(df_a, df_b, how='inner')
print("Common rows based on all common columns ('key1', 'key2'):")
print(implicit_common)
Output:
Common rows based on all common columns ('key1', 'key2'):
key1 key2 value_a value_b
0 K0 A 1 4
Only the row where key1='K0'
AND key2='A'
matched in both DataFrames is returned.
Optional: Handling NaN
Values After Merging
An inner
merge itself focuses on matching keys and doesn't inherently introduce NaN
values in the key columns as part of the intersection logic (rows without matching keys are dropped). However, if the non-key columns from the original DataFrames contained NaN
values, these NaN
s will be carried over into the merged result for the rows that do match.
If you want to remove rows from the merged result that have NaN
in any of their columns (including non-key columns that came from the original DataFrames):
import pandas as pd
import numpy as np # For np.nan
df_x = pd.DataFrame({
'id': [1, 2, 3],
'data_x': ['X1', 'X2', np.nan], # df_x has a NaN
'val': [10,20,30]
})
df_y = pd.DataFrame({
'id': [1, 2, 4],
'data_y': ['Y1', np.nan, 'Y3'], # df_y has a NaN
'val': [10,20,40] # 'val' column also common
})
# Intersection based on 'id' and 'val'
merged_result = pd.merge(df_x, df_y, how='inner', on=['id','val'])
print("Merged result (may contain NaNs from original data):")
print(merged_result)
print()
# Optionally drop rows with any NaN values from the merged result
merged_result_no_na = merged_result.dropna()
# For in-place modification: merged_result.dropna(inplace=True)
print("Merged result after dropna():")
print(merged_result_no_na)
Output:
Merged result (may contain NaNs from original data):
id data_x val data_y
0 1 X1 10 Y1
1 2 X2 20 NaN
Merged result after dropna():
id data_x val data_y
0 1 X1 10 Y1
Using dropna()
here is a post-processing step, applied after the intersection based on keys has been determined.
Conclusion
Finding the common rows or intersection between two Pandas DataFrames is a fundamental operation, effectively achieved using pd.merge(df1, df2, how='inner', on=key_columns)
.
- By specifying the appropriate key column(s) in the
on
parameter, you can precisely define what constitutes a "common row." - Understanding how
pd.merge
handles overlapping non-key columns (with suffixes) and its default behavior whenon
is omitted will allow you to confidently identify shared records across your datasets.