How to Remove Duplicate Dictionaries from a List in Python
This guide explains how to remove duplicate dictionaries from a list in Python, based on the uniqueness of a specific key (like an id
). We'll explore using dictionary comprehensions (most efficient), for
loops, enumerate
, and, briefly, Pandas (for more complex scenarios).
Removing Duplicates Based on a Key (Recommended)
Most often, you will want to remove duplicates based on a specific key within the dictionaries (e.g., an ID, a username, etc.). This assumes that this key should be unique.
Using Dictionary Comprehension
This is the most concise and efficient way to achieve this. It leverages the fact that dictionary keys must be unique:
list_of_dictionaries = [
{'id': 1, 'site': 'tutorialreference.com'},
{'id': 2, 'site': 'google.com'},
{'id': 1, 'site': 'tutorialreference.com'}, # Duplicate ID
]
# Use a dictionary comprehension to remove duplicates based on 'id'
unique_dicts = list({d['id']: d for d in list_of_dictionaries}.values())
print(unique_dicts)
Output:
[{'id': 1, 'site': 'tutorialreference.com'}, {'id': 2, 'site': 'google.com'}]
{d['id']: d for d in list_of_dictionaries}
: This is a dictionary comprehension. It builds a new dictionary.d['id']
: The key of the new dictionary is the'id'
value from each dictionary in the original list.d
: The value of the new dictionary is the entire dictionaryd
from the original list.- Key Uniqueness: Because dictionary keys must be unique, if two dictionaries have the same
'id'
, the later one in the list will overwrite the earlier one in the new dictionary. This is how we achieve deduplication.
.values()
: We extract the values of this new dictionary (which are the unique dictionaries).list(...)
: We convert the result back into a list.
Using a for
Loop
Here's how to do it with a for
loop, which is more verbose but perhaps easier to understand for beginners:
list_of_dictionaries = [
{'id': 1, 'site': 'tutorialreference.com'},
{'id': 2, 'site': 'google.com'},
{'id': 1, 'site': 'tutorialreference.com'},
]
new_list = []
seen_ids = set() # Using set for efficiency.
for dictionary in list_of_dictionaries:
if dictionary['id'] not in seen_ids:
new_list.append(dictionary) # Adds to list
seen_ids.add(dictionary['id']) # Adds id
print(new_list)
Output:
[{'id': 1, 'site': 'tutorialreference.com'}, {'id': 2, 'site': 'google.com'}]
seen_ids = set()
: We use a set to keep track of theid
values we've already seen. Checking for membership in a set (in seen_ids
) is very fast (O(1) on average).- We only add the dictionaries that have a new id.
Using enumerate
(Less Efficient, More Complex)
Using enumerate
to check for duplicates is generally not recommended because it's less efficient and more complex than the other methods:
list_of_dictionaries = [
{'id': 1, 'site': 'tutorialreference.com'},
{'id': 2, 'site': 'google.com'},
{'id': 1, 'site': 'tutorialreference.com'},
]
new_list = [
dictionary for index, dictionary in enumerate(list_of_dictionaries)
if dictionary not in list_of_dictionaries[index + 1:]
]
print(new_list)
Output:
[{'id': 2, 'site': 'google.com'}, {'id': 1, 'site': 'tutorialreference.com'}]
- This approach compares each dictionary to all subsequent dictionaries in the list, making it less efficient (O(n^2) in the worst case) compared to the set-based or dictionary comprehension methods (which are closer to O(n)).
Removing Duplicates Based on Entire Dictionary (Less Common)
If you want to remove dictionaries that are completely identical (all key-value pairs match), you can't directly use a set because dictionaries are not hashable. However, you can use the in
operator with a list:
list_of_dictionaries = [
{'id': 1, 'site': 'tutorialreference.com'},
{'id': 2, 'site': 'google.com'},
{'id': 1, 'site': 'tutorialreference.com'},
]
new_list = []
for dictionary in list_of_dictionaries:
if dictionary not in new_list:
new_list.append(dictionary)
print(new_list)
- You create a new list and iterate over the original list, adding the dictionaries that have not been added before.
Using Pandas drop_duplicates()
(for DataFrames)
If you're working with tabular data, Pandas DataFrames offer a very convenient drop_duplicates()
method:
import pandas as pd
list_of_dictionaries = [
{'id': 1, 'site': 'tutorialreference.com'},
{'id': 2, 'site': 'google.com'},
{'id': 1, 'site': 'tutorialreference.com'},
]
new_list = pd.DataFrame(list_of_dictionaries).drop_duplicates().to_dict('records')
print(new_list)
Output:
[{'id': 1, 'site': 'tutorialreference.com'}, {'id': 2, 'site': 'google.com'}]
pd.DataFrame(list_of_dictionaries)
: Creates a DataFrame from your list of dictionaries..drop_duplicates()
: Removes duplicate rows. By default, it considers all columns for duplication. You can specify a subset of columns using thesubset
parameter (e.g.,drop_duplicates(subset=['id'])
)..to_dict('records')
: Converts the DataFrame back into a list of dictionaries. The'records'
argument is important; it gives you the desired format.