Python Pandas: How to Find Rows in One DataFrame Not Present in Another
Identifying rows that exist in one Pandas DataFrame but are absent in another (based on specific columns or all columns) is a common task in data reconciliation, comparison, and set operations. This is often referred to as an "anti-join" or finding the set difference between DataFrames.
This guide explains how to effectively get rows from one DataFrame that are not present in a second DataFrame, primarily using DataFrame.merge()
with an indicator and boolean indexing with isin()
.
The Goal: Identifying Unique Rows in One DataFrame
Given two Pandas DataFrames, df1
and df2
, we want to find all the rows that are present in df1
but do not have a matching counterpart in df2
. The "match" is typically determined by the values in one or more common key columns, or sometimes by comparing entire rows.
Example DataFrames
import pandas as pd
data1 = {
'ID': [101, 102, 103, 104, 105],
'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
'Category': ['Electronics', 'Accessory', 'Accessory', 'Electronics', 'Accessory']
}
df1 = pd.DataFrame(data1)
data2 = {
'ID': [101, 103, 106, 107], # 102, 104, 105 from df1 are missing here
'Product': ['Laptop', 'Keyboard', 'Tablet', 'Charger'],
'Category': ['Electronics', 'Accessory', 'Electronics', 'Accessory']
}
df2 = pd.DataFrame(data2)
print("DataFrame df1:")
print(df1)
print()
print("DataFrame df2:")
print(df2)
Output:
DataFrame df1:
ID Product Category
0 101 Laptop Electronics
1 102 Mouse Accessory
2 103 Keyboard Accessory
3 104 Monitor Electronics
4 105 Webcam Accessory
DataFrame df2:
ID Product Category
0 101 Laptop Electronics
1 103 Keyboard Accessory
2 106 Tablet Electronics
3 107 Charger Accessory
We want to find rows in df1
whose 'ID' and 'Product' (or just 'ID') are not in df2
.
Method 1: Using DataFrame.merge()
with how='left'
and indicator=True
(Recommended)
This method performs a left join and uses an indicator column to tell us the source of each row in the merged result.
Performing the Left Merge with Indicator
Merge df1
(left) with df2
(right) on the key column(s).
how='left'
: Keeps all rows fromdf1
and the matching rows fromdf2
. Rows indf1
with no match indf2
will haveNaN
fordf2
's columns.on=['key_col1', 'key_col2']
: Specifies the column(s) to join on.indicator=True
: Adds a special column named_merge
to the output. This column indicates the source of each row:'left_only'
: The row's merge key was only indf1
.'right_only'
: The row's merge key was only indf2
(only possible withhow='outer'
).'both'
: The row's merge key was in both DataFrames.
import pandas as pd
df1_example = pd.DataFrame({
'ID': [101, 102, 103, 104, 105], 'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam']
})
df2_example = pd.DataFrame({
'ID': [101, 103, 106, 107], 'Product': ['Laptop', 'Keyboard', 'Tablet', 'Charger']
})
# Specify columns to merge on
merge_columns = ['ID', 'Product'] # Or just ['ID'] if that's the key
# Perform a left merge with indicator
# Using df2.drop_duplicates() on merge_columns is good practice if df2 might have duplicates on keys
merged_df_indicator = df1_example.merge(
df2_example.drop_duplicates(subset=merge_columns), # Ensure df2 keys are unique for merge
on=merge_columns,
how='left',
indicator=True # Adds the '_merge' column
)
print("Merged DataFrame with indicator:")
print(merged_df_indicator)
Output:
Merged DataFrame with indicator:
ID Product _merge
0 101 Laptop both
1 102 Mouse left_only
2 103 Keyboard both
3 104 Monitor left_only
4 105 Webcam left_only
Filtering for left_only
Now, select the rows from the merged DataFrame where the _merge
column is 'left_only'
. These are the rows from df1
that are not in df2
.
import pandas as pd
df1_example = pd.DataFrame({
'ID': [101, 102, 103, 104, 105], 'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam']
})
df2_example = pd.DataFrame({
'ID': [101, 103, 106, 107], 'Product': ['Laptop', 'Keyboard', 'Tablet', 'Charger']
})
# Specify columns to merge on
merge_columns = ['ID', 'Product'] # Or just ['ID'] if that's the key
# Perform a left merge with indicator
# Using df2.drop_duplicates() on merge_columns is good practice if df2 might have duplicates on keys
merged_df_indicator = df1_example.merge(
df2_example.drop_duplicates(subset=merge_columns), # Ensure df2 keys are unique for merge
on=merge_columns,
how='left',
indicator=True # Adds the '_merge' column
)
# ✅ Filter for rows that are only in the left DataFrame (df1)
rows_in_df1_not_in_df2 = merged_df_indicator[merged_df_indicator['_merge'] == 'left_only']
print("Rows from df1 that are NOT in df2:")
print(rows_in_df1_not_in_df2)
Output:
Rows from df1 that are NOT in df2:
ID Product _merge
1 102 Mouse left_only
3 104 Monitor left_only
4 105 Webcam left_only
Customizing the Indicator Column Name
You can provide a string to the indicator
parameter to customize the name of the merge source column.
merged_df_custom_indicator = df1_example.merge(
df2_example.drop_duplicates(subset=merge_columns),
on=merge_columns,
how='left',
indicator='Source_DataFrame' # Custom name
)
# Then filter: rows_not_in = merged_df_custom_indicator[merged_df_custom_indicator['Source_DataFrame'] == 'left_only']
Dropping the Indicator Column
If you only want the original columns from df1
in your final result, drop the _merge
(or custom indicator) column after filtering.
import pandas as pd
df1_example = pd.DataFrame({
'ID': [101, 102, 103, 104, 105], 'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam']
})
df2_example = pd.DataFrame({
'ID': [101, 103, 106, 107], 'Product': ['Laptop', 'Keyboard', 'Tablet', 'Charger']
})
# Specify columns to merge on
merge_columns = ['ID', 'Product'] # Or just ['ID'] if that's the key
# Perform a left merge with indicator
# Using df2.drop_duplicates() on merge_columns is good practice if df2 might have duplicates on keys
merged_df_indicator = df1_example.merge(
df2_example.drop_duplicates(subset=merge_columns), # Ensure df2 keys are unique for merge
on=merge_columns,
how='left',
indicator=True # Adds the '_merge' column
)
# Filter for rows that are only in the left DataFrame (df1)
rows_in_df1_not_in_df2 = merged_df_indicator[merged_df_indicator['_merge'] == 'left_only']
# --- all above already in previous example ---
# Drop the indicator column and any columns that came from df2 (which would be all NaN for left_only rows)
# A simpler way is to select only the original df1 columns from the result.
original_df1_columns = df1_example.columns.tolist()
result_df1_only = rows_in_df1_not_in_df2[original_df1_columns]
# If you merged on all columns of df1 and df2 had extra columns that got added as NaN:
# result_df1_only = rows_in_df1_not_in_df2.drop(columns=['_merge']) # and other df2 columns
print("Final result (original df1 columns, rows not in df2):")
print(result_df1_only)
Output:
Final result (original df1 columns, rows not in df2):
ID Product
1 102 Mouse
3 104 Monitor
4 105 Webcam
Method 2: Using Boolean Indexing with Inverted isin()
(Row-wise Comparison)
This method is suitable when you want to compare based on all columns or a specific set of columns, treating rows as entities.
Performing an Inner Merge to Find Common Rows
First, find the rows that are common to both DataFrames based on the key columns.
import pandas as pd
df1_example = pd.DataFrame({
'ID': [101, 102, 103, 104, 105], 'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam']
})
df2_example = pd.DataFrame({
'ID': [101, 103, 106, 107], 'Product': ['Laptop', 'Keyboard', 'Tablet', 'Charger']
})
merge_columns = ['ID', 'Product']
# Get rows that are common to both (intersection)
common_rows = df1_example.merge(df2_example, on=merge_columns, how='inner')
print("Common rows (intersection):")
print(common_rows)
Output:
Common rows (intersection):
ID Product
0 101 Laptop
1 103 Keyboard
Using isin()
and ~
to Filter df1
Now, check which rows in df1
(based on key columns) are not in the common_rows
.
This approach requires comparing row by row if you want to check based on multiple columns simultaneously as an "entire row match". It's often easier to use merge
as in Method 1.
A more direct isin
approach for multiple columns can be achieved by creating tuples of the rows for comparison, or by checking each key column individually and combining the boolean conditions.
Simpler case: if matching is only on 'ID':
import pandas as pd
df1_example = pd.DataFrame({
'ID': [101, 102, 103, 104, 105], 'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam']
})
df2_example = pd.DataFrame({
'ID': [101, 103, 106, 107], 'Product': ['Laptop', 'Keyboard', 'Tablet', 'Charger']
})
mask_not_in_df2_ids = ~df1_example['ID'].isin(df2_example['ID'])
result_isin_id = df1_example[mask_not_in_df2_ids]
print("Rows from df1 where ID not in df2's IDs (using isin on one col):")
print(result_isin_id)
Output:
Rows from df1 where ID not in df2's IDs (using isin on one col):
ID Product
1 102 Mouse
3 104 Monitor
4 105 Webcam
For matching on multiple columns with isin
:
import pandas as pd
df1_example = pd.DataFrame({
'ID': [101, 102, 103, 104, 105], 'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam']
})
df2_example = pd.DataFrame({
'ID': [101, 103, 106, 107], 'Product': ['Laptop', 'Keyboard', 'Tablet', 'Charger']
})
merge_columns = ['ID', 'Product']
# For a multi-column isin check, create a MultiIndex from df2 for efficient lookup,
# or compare based on tuples if order matters and all columns are used for the key.
# For matching on multiple columns ['ID', 'Product'] with isin:
# Create an index from the merge_columns of df2 for efficient lookup
df2_multi_index = df2_example.set_index(merge_columns).index
df1_tuples_to_check = [tuple(x) for x in df1_example[merge_columns].to_numpy()]
mask_not_in_df2_multi = [item not in df2_multi_index for item in df1_tuples_to_check]
result_isin_multi = df1_example[mask_not_in_df2_multi]
print("Rows from df1 where (ID, Product) tuple not in df2 (using multi-column isin logic):")
print(result_isin_multi)
Output:
Rows from df1 where (ID, Product) tuple not in df2 (using multi-column isin logic):
ID Product
1 102 Mouse
3 104 Monitor
4 105 Webcam
This isin
approach for multiple columns can become complex. The merge
with indicator=True
is usually much cleaner.
Simpler isin()
on Tuples of Rows (for exact all-column match)
If you want to find rows in df1
that are not exact matches of any row in df2
(considering all columns):
import pandas as pd
df1_exact = pd.DataFrame({ 'A': [1,2,3], 'B': ['x','y','z'] })
df2_exact = pd.DataFrame({ 'A': [1,2,4], 'B': ['x','y','w'] })
# Convert rows to tuples for isin check
# This assumes df1 and df2 have the same column order for this to make sense
df1_tuples = [tuple(x) for x in df1_exact.to_numpy()]
df2_tuples = [tuple(x) for x in df2_exact.to_numpy()]
# Find rows in df1_tuples that are not in df2_tuples
mask_not_in_df2_exact = ~pd.Series(df1_tuples).isin(df2_tuples)
result_exact_match = df1_exact[mask_not_in_df2_exact.values] # .values to align mask if index differs
print("Rows from df1_exact not exactly in df2_exact:")
print(result_exact_match)
Output:
Rows from df1_exact not exactly in df2_exact:
A B
2 3 z
This checks for entire row equality. If columns differ, use merge
.
Important Considerations (Matching Columns, Duplicates)
- Key Columns: Clearly define which column(s) determine a "match" between
df1
anddf2
. These are used in theon
parameter ofmerge
. - Duplicates in
df2
: Ifdf2
has duplicate rows based on the merge keys, a left merge might result in duplicated rows fromdf1
. Usingdf2.drop_duplicates(subset=merge_columns)
before merging can prevent this if you only care about existence indf2
. - Column Names: Ensure the key columns have the same names in both DataFrames. If not, use
left_on
andright_on
parameters inmerge
.
Conclusion
To find rows in a Pandas DataFrame (df1
) that are not present in another DataFrame (df2
) based on specific key columns:
- The
df1.merge(df2, on=key_columns, how='left', indicator=True)
method is highly recommended. After merging, filter the result for rows where the indicator column (e.g.,_merge
) is'left_only'
. This clearly shows whichdf1
rows had no match indf2
. - Using
isin()
can work, especially for single key columns (~df1['key'].isin(df2['key'])
). For multiple key columns, constructing tuples or a MultiIndex for theisin
check can be more complex than usingmerge
.
The merge
approach with indicator=True
provides a robust and readable solution for this common "anti-join" or set difference operation between DataFrames.