Skip to main content

Python Pandas: How to Get Distinct (Unique) Rows Across Multiple Columns

Identifying and selecting distinct or unique rows based on the combined values of multiple columns is a common data cleaning and analysis task in Pandas. This is equivalent to SQL's SELECT DISTINCT col1, col2, ... functionality. Pandas provides the DataFrame.drop_duplicates() method as the primary way to achieve this.

This guide explains how to select distinct rows across multiple (or all) DataFrame columns and how to get a single array of all unique values present across selected columns.

The Goal: Identifying Unique Row Combinations

Given a Pandas DataFrame, we want to obtain a new DataFrame that contains only the unique rows. A row is considered unique if its combination of values across specified columns (or all columns) has not appeared before. This effectively removes duplicate entire rows or rows that are duplicates based on a subset of columns.

Example DataFrame

import pandas as pd

data = {
'CustomerID': [101, 102, 101, 103, 102, 101, 104],
'Product': ['Laptop', 'Mouse', 'Laptop', 'Keyboard', 'Mouse', 'Laptop', 'Monitor'],
'Region': ['North', 'South', 'North', 'West', 'South', 'North', 'West'],
'Quantity': [1, 2, 1, 1, 2, 1, 3]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Output:

Original DataFrame:
CustomerID Product Region Quantity
0 101 Laptop North 1
1 102 Mouse South 2
2 101 Laptop North 1
3 103 Keyboard West 1
4 102 Mouse South 2
5 101 Laptop North 1
6 104 Monitor West 3

The DataFrame.drop_duplicates(subset=None, keep='first', inplace=False) method is the primary tool for this.

Considering All Columns for Uniqueness

By default (if subset is not specified), drop_duplicates() considers all columns to identify duplicate rows. It keeps the first occurrence (keep='first') of each unique row by default.

import pandas as pd

df_example = pd.DataFrame({
'CustomerID': [101, 102, 101, 103, 102],
'Product': ['Laptop', 'Mouse', 'Laptop', 'Keyboard', 'Mouse'],
'Region': ['North', 'South', 'North', 'West', 'South']
})

# ✅ Drop duplicate rows, considering all columns
df_distinct_all_cols = df_example.drop_duplicates()

print("DataFrame with distinct rows (all columns considered):")
print(df_distinct_all_cols)

Output:

DataFrame with distinct rows (all columns considered):
CustomerID Product Region
0 101 Laptop North
1 102 Mouse South
3 103 Keyboard West
note

Rows at original index 2 (duplicate of 0) and 4 (duplicate of 1) are removed.

Considering a SUBSET of Columns for Uniqueness

You can specify a list of column names to the subset parameter. Duplicates will be identified based only on the values in these specified columns.

import pandas as pd

df_example = pd.DataFrame({
'CustomerID': [101, 102, 101, 103, 102, 101],
'Product': ['Laptop', 'Mouse', 'Laptop', 'Keyboard', 'Mouse', 'Webcam'], # Note: Product for last 101 is different
'Region': ['North', 'South', 'North', 'West', 'South', 'North']
})

# ✅ Drop duplicates based on 'CustomerID' and 'Region' combination
# It will keep the first row for each unique (CustomerID, Region) pair.
df_distinct_subset = df_example.drop_duplicates(subset=['CustomerID', 'Region'])

print("DataFrame with distinct rows based on 'CustomerID' and 'Region':")
print(df_distinct_subset)

Output:

DataFrame with distinct rows based on 'CustomerID' and 'Region':
CustomerID Product Region
0 101 Laptop North
1 102 Mouse South
3 103 Keyboard West
  • The keep parameter can be 'first' (default), 'last' (keep the last occurrence), or False (drop all duplicates).

Modifying the DataFrame In-Place (inplace=True)

To modify the original DataFrame directly, use inplace=True.

import pandas as pd

df_example = pd.DataFrame({
'CustomerID': [101, 102, 101, 103, 102, 101],
'Product': ['Laptop', 'Mouse', 'Laptop', 'Keyboard', 'Mouse', 'Webcam'], # Note: Product for last 101 is different
'Region': ['North', 'South', 'North', 'West', 'South', 'North']
})

df_inplace_example = df_example.copy() # Work on a copy for this demo
df_inplace_example.drop_duplicates(inplace=True)
print("DataFrame after inplace drop_duplicates():")
print(df_inplace_example) # df_inplace_example itself is modified

Output:

DataFrame after inplace drop_duplicates():
CustomerID Product Region
0 101 Laptop North
1 102 Mouse South
3 103 Keyboard West
5 101 Webcam North

When inplace=True, the method returns None.

Getting ALL Unique Values Across Multiple Columns into a SINGLE Array/List

This is a different goal: you want a flat list/array of all unique values that appear in any of the specified columns, not unique row combinations.

Using pd.unique(df[cols_list].values.ravel('K'))

  1. Select the desired columns: df[['ColA', 'ColB']].
  2. Get their values as a NumPy array: .values.
  3. Flatten this 2D array into 1D: .ravel('K') (or 'F' or 'C'). 'K' flattens in memory order, which is often fastest.
  4. Pass the flattened array to pd.unique() to get unique values.
import pandas as pd

df_example = pd.DataFrame({
'ColX': ['Apple', 'Banana', 'Apple', 'Orange'],
'ColY': ['Red', 'Yellow', 'Green', 'Orange'],
'ColZ': ['FruitA', 'FruitB', 'FruitC', 'FruitD']
})
columns_to_get_uniques_from = ['ColX', 'ColY']

# Select columns, get values, ravel (flatten), then find unique
unique_values_across_cols = pd.unique(df_example[columns_to_get_uniques_from].values.ravel('K'))

print(f"Single array of unique values from columns {columns_to_get_uniques_from}:")
print(unique_values_across_cols)

Output:

Single array of unique values from columns ['ColX', 'ColY']:
['Apple' 'Banana' 'Orange' 'Red' 'Yellow' 'Green']

Using numpy.unique(df[cols_list].values)

NumPy's np.unique() can also be applied to the flattened array of values. It returns a sorted array of unique values.

import pandas as pd
import numpy as np

df_example = pd.DataFrame({
'ColX': ['Apple', 'Banana', 'Apple', 'Orange'],
'ColY': ['Red', 'Yellow', 'Green', 'Orange'],
})
columns_to_get_uniques_from = ['ColX', 'ColY']

# Using np.unique (which also flattens the 2D .values array by default)
unique_values_np = np.unique(df_example[columns_to_get_uniques_from].values)

print(f"Single array of unique values (sorted) from {columns_to_get_uniques_from} (using np.unique):")
print(unique_values_np)

Output:

Single array of unique values (sorted) from ['ColX', 'ColY'] (using np.unique):
['Apple' 'Banana' 'Green' 'Orange' 'Red' 'Yellow']

Getting Unique Values for EACH Column Individually (Not Row Combinations)

If you want a dictionary where keys are column names and values are lists/arrays of unique values within that specific column, you can loop or use apply().

import pandas as pd

df_example = pd.DataFrame({
'ColX': ['Apple', 'Banana', 'Apple', 'Orange'],
'ColY': ['Red', 'Yellow', 'Green', 'Orange'],
})

unique_per_column = {}
for col_name in df_example.columns:
unique_per_column[col_name] = df_example[col_name].unique()

print("Unique values for each column individually:")
print(unique_per_column)
print()

# Using apply:
unique_per_column_apply = df_example.apply(lambda col: col.unique())
print(unique_per_column_apply)

Output:

Unique values for each column individually:
{'ColX': array(['Apple', 'Banana', 'Orange'], dtype=object), 'ColY': array(['Red', 'Yellow', 'Green', 'Orange'], dtype=object)}

ColX [Apple, Banana, Orange]
ColY [Red, Yellow, Green, Orange]
dtype: object
note

This is different from finding unique row combinations.

Conclusion

Pandas provides powerful tools for handling distinct values and rows:

  • To get distinct row combinations based on all columns or a subset of columns, the DataFrame.drop_duplicates(subset=..., keep=...) method is the standard and recommended approach.
  • To get a single flat array/list of all unique values present across several specified columns, select those columns, get their .values, .ravel() them into a 1D array, and then use pd.unique() or np.unique().

Understanding the difference between these operations is key: drop_duplicates works on rows based on combinations of column values, while pd.unique on a raveled array works on individual values across columns.