Skip to main content

Python Pandas: How to Find Rows in One DataFrame Not Present in Another

Identifying rows that exist in one Pandas DataFrame but are absent in another (based on specific columns or all columns) is a common task in data reconciliation, comparison, and set operations. This is often referred to as an "anti-join" or finding the set difference between DataFrames.

This guide explains how to effectively get rows from one DataFrame that are not present in a second DataFrame, primarily using DataFrame.merge() with an indicator and boolean indexing with isin().

The Goal: Identifying Unique Rows in One DataFrame

Given two Pandas DataFrames, df1 and df2, we want to find all the rows that are present in df1 but do not have a matching counterpart in df2. The "match" is typically determined by the values in one or more common key columns, or sometimes by comparing entire rows.

Example DataFrames

import pandas as pd

data1 = {
'ID': [101, 102, 103, 104, 105],
'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
'Category': ['Electronics', 'Accessory', 'Accessory', 'Electronics', 'Accessory']
}
df1 = pd.DataFrame(data1)

data2 = {
'ID': [101, 103, 106, 107], # 102, 104, 105 from df1 are missing here
'Product': ['Laptop', 'Keyboard', 'Tablet', 'Charger'],
'Category': ['Electronics', 'Accessory', 'Electronics', 'Accessory']
}
df2 = pd.DataFrame(data2)

print("DataFrame df1:")
print(df1)
print()

print("DataFrame df2:")
print(df2)

Output:

DataFrame df1:
ID Product Category
0 101 Laptop Electronics
1 102 Mouse Accessory
2 103 Keyboard Accessory
3 104 Monitor Electronics
4 105 Webcam Accessory

DataFrame df2:
ID Product Category
0 101 Laptop Electronics
1 103 Keyboard Accessory
2 106 Tablet Electronics
3 107 Charger Accessory

We want to find rows in df1 whose 'ID' and 'Product' (or just 'ID') are not in df2.

This method performs a left join and uses an indicator column to tell us the source of each row in the merged result.

Performing the Left Merge with Indicator

Merge df1 (left) with df2 (right) on the key column(s).

  • how='left': Keeps all rows from df1 and the matching rows from df2. Rows in df1 with no match in df2 will have NaN for df2's columns.
  • on=['key_col1', 'key_col2']: Specifies the column(s) to join on.
  • indicator=True: Adds a special column named _merge to the output. This column indicates the source of each row:
    • 'left_only': The row's merge key was only in df1.
    • 'right_only': The row's merge key was only in df2 (only possible with how='outer').
    • 'both': The row's merge key was in both DataFrames.
import pandas as pd

df1_example = pd.DataFrame({
'ID': [101, 102, 103, 104, 105], 'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam']
})
df2_example = pd.DataFrame({
'ID': [101, 103, 106, 107], 'Product': ['Laptop', 'Keyboard', 'Tablet', 'Charger']
})

# Specify columns to merge on
merge_columns = ['ID', 'Product'] # Or just ['ID'] if that's the key

# Perform a left merge with indicator
# Using df2.drop_duplicates() on merge_columns is good practice if df2 might have duplicates on keys
merged_df_indicator = df1_example.merge(
df2_example.drop_duplicates(subset=merge_columns), # Ensure df2 keys are unique for merge
on=merge_columns,
how='left',
indicator=True # Adds the '_merge' column
)

print("Merged DataFrame with indicator:")
print(merged_df_indicator)

Output:

Merged DataFrame with indicator:
ID Product _merge
0 101 Laptop both
1 102 Mouse left_only
2 103 Keyboard both
3 104 Monitor left_only
4 105 Webcam left_only

Filtering for left_only

Now, select the rows from the merged DataFrame where the _merge column is 'left_only'. These are the rows from df1 that are not in df2.

import pandas as pd

df1_example = pd.DataFrame({
'ID': [101, 102, 103, 104, 105], 'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam']
})
df2_example = pd.DataFrame({
'ID': [101, 103, 106, 107], 'Product': ['Laptop', 'Keyboard', 'Tablet', 'Charger']
})

# Specify columns to merge on
merge_columns = ['ID', 'Product'] # Or just ['ID'] if that's the key

# Perform a left merge with indicator
# Using df2.drop_duplicates() on merge_columns is good practice if df2 might have duplicates on keys
merged_df_indicator = df1_example.merge(
df2_example.drop_duplicates(subset=merge_columns), # Ensure df2 keys are unique for merge
on=merge_columns,
how='left',
indicator=True # Adds the '_merge' column
)

# ✅ Filter for rows that are only in the left DataFrame (df1)
rows_in_df1_not_in_df2 = merged_df_indicator[merged_df_indicator['_merge'] == 'left_only']

print("Rows from df1 that are NOT in df2:")
print(rows_in_df1_not_in_df2)

Output:

Rows from df1 that are NOT in df2:
ID Product _merge
1 102 Mouse left_only
3 104 Monitor left_only
4 105 Webcam left_only

Customizing the Indicator Column Name

You can provide a string to the indicator parameter to customize the name of the merge source column.

merged_df_custom_indicator = df1_example.merge(
df2_example.drop_duplicates(subset=merge_columns),
on=merge_columns,
how='left',
indicator='Source_DataFrame' # Custom name
)
# Then filter: rows_not_in = merged_df_custom_indicator[merged_df_custom_indicator['Source_DataFrame'] == 'left_only']

Dropping the Indicator Column

If you only want the original columns from df1 in your final result, drop the _merge (or custom indicator) column after filtering.

import pandas as pd

df1_example = pd.DataFrame({
'ID': [101, 102, 103, 104, 105], 'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam']
})
df2_example = pd.DataFrame({
'ID': [101, 103, 106, 107], 'Product': ['Laptop', 'Keyboard', 'Tablet', 'Charger']
})

# Specify columns to merge on
merge_columns = ['ID', 'Product'] # Or just ['ID'] if that's the key

# Perform a left merge with indicator
# Using df2.drop_duplicates() on merge_columns is good practice if df2 might have duplicates on keys
merged_df_indicator = df1_example.merge(
df2_example.drop_duplicates(subset=merge_columns), # Ensure df2 keys are unique for merge
on=merge_columns,
how='left',
indicator=True # Adds the '_merge' column
)

# Filter for rows that are only in the left DataFrame (df1)
rows_in_df1_not_in_df2 = merged_df_indicator[merged_df_indicator['_merge'] == 'left_only']

# --- all above already in previous example ---

# Drop the indicator column and any columns that came from df2 (which would be all NaN for left_only rows)
# A simpler way is to select only the original df1 columns from the result.
original_df1_columns = df1_example.columns.tolist()
result_df1_only = rows_in_df1_not_in_df2[original_df1_columns]

# If you merged on all columns of df1 and df2 had extra columns that got added as NaN:
# result_df1_only = rows_in_df1_not_in_df2.drop(columns=['_merge']) # and other df2 columns

print("Final result (original df1 columns, rows not in df2):")
print(result_df1_only)

Output:

Final result (original df1 columns, rows not in df2):
ID Product
1 102 Mouse
3 104 Monitor
4 105 Webcam

Method 2: Using Boolean Indexing with Inverted isin() (Row-wise Comparison)

This method is suitable when you want to compare based on all columns or a specific set of columns, treating rows as entities.

Performing an Inner Merge to Find Common Rows

First, find the rows that are common to both DataFrames based on the key columns.

import pandas as pd

df1_example = pd.DataFrame({
'ID': [101, 102, 103, 104, 105], 'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam']
})
df2_example = pd.DataFrame({
'ID': [101, 103, 106, 107], 'Product': ['Laptop', 'Keyboard', 'Tablet', 'Charger']
})
merge_columns = ['ID', 'Product']

# Get rows that are common to both (intersection)
common_rows = df1_example.merge(df2_example, on=merge_columns, how='inner')
print("Common rows (intersection):")
print(common_rows)

Output:

Common rows (intersection):
ID Product
0 101 Laptop
1 103 Keyboard

Using isin() and ~ to Filter df1

Now, check which rows in df1 (based on key columns) are not in the common_rows. This approach requires comparing row by row if you want to check based on multiple columns simultaneously as an "entire row match". It's often easier to use merge as in Method 1.

A more direct isin approach for multiple columns can be achieved by creating tuples of the rows for comparison, or by checking each key column individually and combining the boolean conditions.

Simpler case: if matching is only on 'ID':

import pandas as pd

df1_example = pd.DataFrame({
'ID': [101, 102, 103, 104, 105], 'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam']
})
df2_example = pd.DataFrame({
'ID': [101, 103, 106, 107], 'Product': ['Laptop', 'Keyboard', 'Tablet', 'Charger']
})

mask_not_in_df2_ids = ~df1_example['ID'].isin(df2_example['ID'])
result_isin_id = df1_example[mask_not_in_df2_ids]
print("Rows from df1 where ID not in df2's IDs (using isin on one col):")
print(result_isin_id)

Output:

Rows from df1 where ID not in df2's IDs (using isin on one col):
ID Product
1 102 Mouse
3 104 Monitor
4 105 Webcam

For matching on multiple columns with isin:

import pandas as pd

df1_example = pd.DataFrame({
'ID': [101, 102, 103, 104, 105], 'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam']
})
df2_example = pd.DataFrame({
'ID': [101, 103, 106, 107], 'Product': ['Laptop', 'Keyboard', 'Tablet', 'Charger']
})
merge_columns = ['ID', 'Product']

# For a multi-column isin check, create a MultiIndex from df2 for efficient lookup,
# or compare based on tuples if order matters and all columns are used for the key.

# For matching on multiple columns ['ID', 'Product'] with isin:
# Create an index from the merge_columns of df2 for efficient lookup
df2_multi_index = df2_example.set_index(merge_columns).index
df1_tuples_to_check = [tuple(x) for x in df1_example[merge_columns].to_numpy()]

mask_not_in_df2_multi = [item not in df2_multi_index for item in df1_tuples_to_check]
result_isin_multi = df1_example[mask_not_in_df2_multi]

print("Rows from df1 where (ID, Product) tuple not in df2 (using multi-column isin logic):")
print(result_isin_multi)

Output:

Rows from df1 where (ID, Product) tuple not in df2 (using multi-column isin logic):
ID Product
1 102 Mouse
3 104 Monitor
4 105 Webcam
note

This isin approach for multiple columns can become complex. The merge with indicator=True is usually much cleaner.

Simpler isin() on Tuples of Rows (for exact all-column match)

If you want to find rows in df1 that are not exact matches of any row in df2 (considering all columns):

import pandas as pd

df1_exact = pd.DataFrame({ 'A': [1,2,3], 'B': ['x','y','z'] })
df2_exact = pd.DataFrame({ 'A': [1,2,4], 'B': ['x','y','w'] })

# Convert rows to tuples for isin check
# This assumes df1 and df2 have the same column order for this to make sense
df1_tuples = [tuple(x) for x in df1_exact.to_numpy()]
df2_tuples = [tuple(x) for x in df2_exact.to_numpy()]

# Find rows in df1_tuples that are not in df2_tuples
mask_not_in_df2_exact = ~pd.Series(df1_tuples).isin(df2_tuples)
result_exact_match = df1_exact[mask_not_in_df2_exact.values] # .values to align mask if index differs

print("Rows from df1_exact not exactly in df2_exact:")
print(result_exact_match)

Output:

Rows from df1_exact not exactly in df2_exact:
A B
2 3 z

This checks for entire row equality. If columns differ, use merge.

Important Considerations (Matching Columns, Duplicates)

  • Key Columns: Clearly define which column(s) determine a "match" between df1 and df2. These are used in the on parameter of merge.
  • Duplicates in df2: If df2 has duplicate rows based on the merge keys, a left merge might result in duplicated rows from df1. Using df2.drop_duplicates(subset=merge_columns) before merging can prevent this if you only care about existence in df2.
  • Column Names: Ensure the key columns have the same names in both DataFrames. If not, use left_on and right_on parameters in merge.

Conclusion

To find rows in a Pandas DataFrame (df1) that are not present in another DataFrame (df2) based on specific key columns:

  1. The df1.merge(df2, on=key_columns, how='left', indicator=True) method is highly recommended. After merging, filter the result for rows where the indicator column (e.g., _merge) is 'left_only'. This clearly shows which df1 rows had no match in df2.
  2. Using isin() can work, especially for single key columns (~df1['key'].isin(df2['key'])). For multiple key columns, constructing tuples or a MultiIndex for the isin check can be more complex than using merge.

The merge approach with indicator=True provides a robust and readable solution for this common "anti-join" or set difference operation between DataFrames.