Skip to main content

Python Pandas: How to Find Common Rows (Intersection) Between Two DataFrames

In data analysis and manipulation, a common task is to identify records or rows that are present in two different datasets. This "intersection" of DataFrames helps in finding overlapping information, validating data consistency, or preparing data for combined analysis. Pandas, with its powerful merge() function, provides an efficient and SQL-like way to perform this operation.

This guide will walk you through the process of finding common rows between two Pandas DataFrames, focusing on the pd.merge() method with how='inner'. You'll learn how to specify single or multiple columns for matching and understand the key parameters that control the merge behavior.

The Goal: Identifying Overlapping Records Across DataFrames

Imagine you have two DataFrames, perhaps from different sources or time periods, and you need to find out which rows (based on certain identifying columns) exist in both of them. This is akin to finding the intersection of two sets of data.

Let's define two sample DataFrames for our examples:

import pandas as pd

df1 = pd.DataFrame({
'student_id': ['S101', 'S102', 'S103', 'S104'],
'course_name': ['Math', 'Science', 'History', 'Math'],
'grade': [85, 90, 78, 92]
})

df2 = pd.DataFrame({
'student_id': ['S102', 'S104', 'S105', 'S101'],
'course_name': ['Science', 'Art', 'Physics', 'Math'],
'attendance': [0.95, 0.88, 0.92, 0.98]
})

print("DataFrame 1 (df1):")
print(df1)
print()

print("DataFrame 2 (df2):")
print(df2)

Output:

Common rows based on 'student_id':
student_id course_name_x grade course_name_y attendance
0 S101 Math 85 Math 0.98
1 S102 Science 90 Science 0.95
2 S104 Math 92 Art 0.88

The Pandas Solution: pd.merge() with how='inner'

The primary way to find the intersection of rows in Pandas is by using the pd.merge() function with the how='inner' argument. An inner merge returns only the rows where the key(s) specified in the on parameter exist in both DataFrames.

Finding Common Rows Based on a Single Key Column

Let's find rows that are common based on the student_id column.

import pandas as pd

df1 = pd.DataFrame({
'student_id': ['S101', 'S102', 'S103', 'S104'],
'course_name': ['Math', 'Science', 'History', 'Math'],
'grade': [85, 90, 78, 92]
})

df2 = pd.DataFrame({
'student_id': ['S102', 'S104', 'S105', 'S101'],
'course_name': ['Science', 'Art', 'Physics', 'Math'],
'attendance': [0.95, 0.88, 0.92, 0.98]
})

# Find common rows based on the 'student_id' column
common_rows_by_id = pd.merge(df1, df2, how='inner', on=['student_id'])
# You can also use on='student_id' if it's a single string

print("Common rows based on 'student_id':")
print(common_rows_by_id)

Output:

Common rows based on 'student_id':
student_id course_name_x grade course_name_y attendance
0 S101 Math 85 Math 0.98
1 S102 Science 90 Science 0.95
2 S104 Math 92 Art 0.88

The result includes rows where student_id values (S101, S102, S104) were present in both df1 and df2.

Understanding Merge Suffixes (e.g., _x, _y)

In the output above, notice course_name_x and course_name_y. When pd.merge() joins DataFrames, if there are columns with the same name in both DataFrames that are not part of the join keys (specified in on), Pandas automatically appends suffixes (_x for the left DataFrame, _y for the right DataFrame) to differentiate them. You can customize these suffixes using the suffixes=('_left_suffix', '_right_suffix') argument in pd.merge().

Key pd.merge() Parameters: how, on, left_on, right_on

  • left, right: The two DataFrames to merge.
  • how='inner': This is crucial for finding the intersection. It uses only the keys that exist in both DataFrames. Other options include 'left', 'right', and 'outer'.
  • on: A column name (string) or a list of column names to join on. These columns must exist in both DataFrames.
  • left_on, right_on: Use these if the key column(s) have different names in the left and right DataFrames. For example, pd.merge(df1, df2, how='inner', left_on='student_id', right_on='pupil_id').
  • suffixes=('_x', '_y'): A tuple of strings to append to overlapping column names that are not join keys.

Finding Common Rows Based on Multiple Key Columns

To consider rows common only if values in multiple specified columns match, provide a list of these column names to the on parameter.

import pandas as pd

df1 = pd.DataFrame({
'student_id': ['S101', 'S102', 'S103', 'S104'],
'course_name': ['Math', 'Science', 'History', 'Math'],
'grade': [85, 90, 78, 92]
})

df2 = pd.DataFrame({
'student_id': ['S102', 'S104', 'S105', 'S101'],
'course_name': ['Science', 'Art', 'Physics', 'Math'],
'attendance': [0.95, 0.88, 0.92, 0.98]
})

# Find common rows where both 'student_id' AND 'course_name' match
common_rows_by_id_and_course = pd.merge(
df1,
df2,
how='inner',
on=['student_id', 'course_name'] # Match on both columns
)

print("Common rows based on 'student_id' AND 'course_name':")
print(common_rows_by_id_and_course)

Output:

Common rows based on 'student_id' AND 'course_name':
student_id course_name grade attendance
0 S101 Math 85 0.98
1 S102 Science 90 0.95

In this case, student_id='S104' is no longer in the result because while S104 exists in both, the course_name for S104 is 'Math' in df1 and 'Art' in df2, so they don't match on this second key. The columns grade and attendance do not have suffixes because course_name is now part of the join key and the remaining columns (grade from df1, attendance from df2) are unique to their original DataFrames.

Implicit Intersection: Merging on All Common Column Names

If you call pd.merge(df1, df2, how='inner') without specifying the on, left_on, or right_on parameters, Pandas will implicitly use all column names that are common to both DataFrames as the join keys.

import pandas as pd

df_a = pd.DataFrame({
'key1': ['K0', 'K1', 'K2'],
'key2': ['A', 'B', 'C'],
'value_a': [1, 2, 3]
})
df_b = pd.DataFrame({
'key1': ['K0', 'K1', 'K3'],
'key2': ['A', 'X', 'C'], # Note: 'B' in df_a vs 'X' in df_b for K1
'value_b': [4, 5, 6]
})

# Merge on all common columns ('key1', 'key2') because 'on' is not specified
implicit_common = pd.merge(df_a, df_b, how='inner')

print("Common rows based on all common columns ('key1', 'key2'):")
print(implicit_common)

Output:

Common rows based on all common columns ('key1', 'key2'):
key1 key2 value_a value_b
0 K0 A 1 4

Only the row where key1='K0' AND key2='A' matched in both DataFrames is returned.

Optional: Handling NaN Values After Merging

An inner merge itself focuses on matching keys and doesn't inherently introduce NaN values in the key columns as part of the intersection logic (rows without matching keys are dropped). However, if the non-key columns from the original DataFrames contained NaN values, these NaNs will be carried over into the merged result for the rows that do match.

If you want to remove rows from the merged result that have NaN in any of their columns (including non-key columns that came from the original DataFrames):

import pandas as pd
import numpy as np # For np.nan

df_x = pd.DataFrame({
'id': [1, 2, 3],
'data_x': ['X1', 'X2', np.nan], # df_x has a NaN
'val': [10,20,30]
})
df_y = pd.DataFrame({
'id': [1, 2, 4],
'data_y': ['Y1', np.nan, 'Y3'], # df_y has a NaN
'val': [10,20,40] # 'val' column also common
})

# Intersection based on 'id' and 'val'
merged_result = pd.merge(df_x, df_y, how='inner', on=['id','val'])
print("Merged result (may contain NaNs from original data):")
print(merged_result)
print()

# Optionally drop rows with any NaN values from the merged result
merged_result_no_na = merged_result.dropna()
# For in-place modification: merged_result.dropna(inplace=True)

print("Merged result after dropna():")
print(merged_result_no_na)

Output:

Merged result (may contain NaNs from original data):
id data_x val data_y
0 1 X1 10 Y1
1 2 X2 20 NaN

Merged result after dropna():
id data_x val data_y
0 1 X1 10 Y1

Using dropna() here is a post-processing step, applied after the intersection based on keys has been determined.

Conclusion

Finding the common rows or intersection between two Pandas DataFrames is a fundamental operation, effectively achieved using pd.merge(df1, df2, how='inner', on=key_columns).

  • By specifying the appropriate key column(s) in the on parameter, you can precisely define what constitutes a "common row."
  • Understanding how pd.merge handles overlapping non-key columns (with suffixes) and its default behavior when on is omitted will allow you to confidently identify shared records across your datasets.