Python Pandas: Select DataFrame Rows Where Two Columns Are Equal (or Not Equal)
Comparing values across different columns within the same row is a common data validation and filtering task in Pandas. You might need to identify rows where two specific columns have identical values, or conversely, where they differ.
This guide demonstrates how to select or filter Pandas DataFrame rows based on the equality (or inequality) of values in two columns, primarily using boolean indexing and the DataFrame.query()
method.
The Goal: Comparing Values Across Two Columns Row-wise
Given a Pandas DataFrame, we want to compare the values in two specified columns for each row. Based on this comparison (equal or not equal), we want to select the subset of rows that meets our criteria.
Example DataFrame
import pandas as pd
import numpy as np # For NaN example
data = {
'ID': [101, 102, 103, 104, 105, 106],
'SKU_A': ['X123', 'Y456', 'Z789', 'X123', 'A001', 'B002'],
'SKU_B': ['X123', 'Y450', 'Z789', 'X124', 'A001', np.nan], # SKU_B has different values and a NaN
'Quantity_A': [10, 5, 20, 10, 15, 8],
'Quantity_B': [10, 7, 15, 10, 15, 8]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
ID SKU_A SKU_B Quantity_A Quantity_B
0 101 X123 X123 10 10
1 102 Y456 Y450 5 7
2 103 Z789 Z789 20 15
3 104 X123 X124 10 10
4 105 A001 A001 15 15
5 106 B002 NaN 8 8
Select Rows Where Two Columns ARE Equal
We might want to find rows where SKU_A
equals SKU_B
, or where Quantity_A
equals Quantity_B
.
Using Boolean Indexing with ==
(Recommended)
This is the most direct and idiomatic Pandas way. Compare the two columns element-wise to create a boolean Series, then use this Series to filter the DataFrame.
import pandas as pd
import numpy as np
df_example = pd.DataFrame({
'ID': [101, 102, 103, 104, 105, 106],
'SKU_A': ['X123', 'Y456', 'Z789', 'X123', 'A001', 'B002'],
'SKU_B': ['X123', 'Y450', 'Z789', 'X124', 'A001', np.nan],
})
# Compare 'SKU_A' and 'SKU_B' for equality
sku_equality_mask = (df_example['SKU_A'] == df_example['SKU_B'])
print("Boolean mask for SKU_A == SKU_B:")
print(sku_equality_mask)
print()
# ✅ Use the mask to select rows where the condition is True
df_skus_equal = df_example[sku_equality_mask]
# Or directly: df_skus_equal = df_example[df_example['SKU_A'] == df_example['SKU_B']]
print("Rows where SKU_A == SKU_B:")
print(df_skus_equal)
Output:
Boolean mask for SKU_A == SKU_B:
0 True
1 False
2 True
3 False
4 True
5 False
dtype: bool
Rows where SKU_A == SKU_B:
ID SKU_A SKU_B
0 101 X123 X123
2 103 Z789 Z789
4 105 A001 A001
df['Column1'] == df['Column2']
performs an element-wise comparison and returns a boolean Series.df[boolean_series]
uses this boolean Series to select rows where the value isTrue
.
Using DataFrame.query('ColA == ColB')
The DataFrame.query()
method allows you to specify the condition as a string.
import pandas as pd
import numpy as np
df_example = pd.DataFrame({
'ID': [101, 102, 103, 104, 105, 106],
'SKU_A': ['X123', 'Y456', 'Z789', 'X123', 'A001', 'B002'],
'SKU_B': ['X123', 'Y450', 'Z789', 'X124', 'A001', np.nan],
})
# ✅ Use query() to select rows where SKU_A equals SKU_B
df_skus_equal_query = df_example.query('SKU_A == SKU_B')
print("Rows where SKU_A == SKU_B (using query()):")
print(df_skus_equal_query)
Output:
Rows where SKU_A == SKU_B (using query()):
ID SKU_A SKU_B
0 101 X123 X123
2 103 Z789 Z789
4 105 A001 A001
- If your column names have spaces, enclose them in backticks:
query("`Column A` == `Column B`")
.
Select Rows Where Two Columns Are NOT Equal
Using Boolean Indexing with !=
(Recommended)
Use the "not equal" operator (!=
).
import pandas as pd
import numpy as np
df_example = pd.DataFrame({
'ID': [101, 102, 103, 104, 105, 106],
'SKU_A': ['X123', 'Y456', 'Z789', 'X123', 'A001', 'B002'],
'SKU_B': ['X123', 'Y450', 'Z789', 'X124', 'A001', np.nan],
})
# ✅ Select rows where SKU_A is NOT equal to SKU_B
df_skus_not_equal = df_example[df_example['SKU_A'] != df_example['SKU_B']]
print("Rows where SKU_A != SKU_B:")
print(df_skus_not_equal)
Output:
Rows where SKU_A != SKU_B:
ID SKU_A SKU_B
1 102 Y456 Y450
3 104 X123 X124
5 106 B002 NaN
Using Boolean Indexing with ~
(Tilde for Negation)
You can negate the result of an equality check using the tilde (~
) operator. Remember to wrap the equality condition in parentheses.
import pandas as pd
import numpy as np
df_example = pd.DataFrame({
'ID': [101, 102, 103, 104, 105, 106],
'SKU_A': ['X123', 'Y456', 'Z789', 'X123', 'A001', 'B002'],
'SKU_B': ['X123', 'Y450', 'Z789', 'X124', 'A001', np.nan],
})
# ✅ Negate the equality condition
df_skus_not_equal_tilde = df_example[~(df_example['SKU_A'] == df_example['SKU_B'])]
print("Rows where SKU_A != SKU_B (using tilde ~):")
print(df_skus_not_equal_tilde)
Output: (Same as using !=)
Rows where SKU_A != SKU_B (using tilde ~):
ID SKU_A SKU_B
1 102 Y456 Y450
3 104 X123 X124
5 106 B002 NaN
~(condition)
inverts the boolean Series produced bycondition
.
Using DataFrame.query('ColA != ColB')
import pandas as pd
import numpy as np
df_example = pd.DataFrame({
'ID': [101, 102, 103, 104, 105, 106],
'SKU_A': ['X123', 'Y456', 'Z789', 'X123', 'A001', 'B002'],
'SKU_B': ['X123', 'Y450', 'Z789', 'X124', 'A001', np.nan],
})
# ✅ Use query() with the inequality operator
df_skus_not_equal_query = df_example.query('SKU_A != SKU_B')
print("Rows where SKU_A != SKU_B (using query()):")
print(df_skus_not_equal_query)
Output: (Same as using !=)
Rows where SKU_A != SKU_B (using query()):
ID SKU_A SKU_B
1 102 Y456 Y450
3 104 X123 X124
5 106 B002 NaN
Handling NaN
Values in Comparisons
An important aspect of comparisons in Pandas (and NumPy) is how NaN
(Not a Number) values are treated:
NaN == NaN
isFalse
.NaN != NaN
isTrue
.- Any comparison of a non-NaN value with
NaN
(e.g.,'X123' == np.nan
or'X123' != np.nan
) results inFalse
for==
andTrue
for!=
.
Consider our original DataFrame where row 5 has SKU_B
as NaN
:
# For row 5: df['SKU_A'] is 'B002', df['SKU_B'] is np.nan
df['SKU_A'] == df['SKU_B'] # -> 'B002' == np.nan -> False
df['SKU_A'] != df['SKU_B'] # -> 'B002' != np.nan -> True
If you need to treat rows where both columns are NaN
as "equal" for the purpose of your filter, or handle NaN
s differently, you would need more specific logic:
# Example: Consider rows equal if both are NaN OR both are non-NaN and equal
condition_both_nan = df['SKU_A'].isna() & df['SKU_B'].isna()
condition_both_not_nan_and_equal = df['SKU_A'].notna() & df['SKU_B'].notna() & (df['SKU_A'] == df['SKU_B'])
df_equal_nan_aware = df[condition_both_nan | condition_both_not_nan_and_equal]
This is more advanced and depends on the specific definition of "equal" in the presence of missing values. For standard ==
and !=
, NaN
s behave as described above.
Conclusion
Comparing values between two columns in a Pandas DataFrame and selecting rows based on that comparison is straightforward:
- For selecting rows where values in two columns are equal:
- Use boolean indexing:
df[df['ColumnA'] == df['ColumnB']]
(Recommended). - Use
DataFrame.query()
:df.query('ColumnA == ColumnB')
.
- Use boolean indexing:
- For selecting rows where values in two columns are NOT equal:
- Use boolean indexing:
df[df['ColumnA'] != df['ColumnB']]
(Recommended). - Or negate equality:
df[~(df['ColumnA'] == df['ColumnB'])]
. - Use
DataFrame.query()
:df.query('ColumnA != ColumnB')
.
- Use boolean indexing:
Remember how NaN
values are handled in comparisons (NaN
is not equal to anything, including itself) and adjust your logic if you need specific NaN
handling. These methods provide powerful ways to filter and subset your data based on inter-column relationships.