How to Count Unique Words or Unique Characters in a String in Python
Counting the number of unique items (either words or individual characters) within a piece of text is a common task in text processing and data analysis. Python's built-in set
data structure, which automatically stores only unique elements, provides a highly efficient way to achieve this.
This guide demonstrates how to count unique words and characters in strings and text files using sets, list comprehensions, and loops.
Count Unique Words in a String
Goal: Find the number of distinct words in a given string.
Using split()
and set()
(Recommended)
This is the most concise and Pythonic method.
- Split the string into a list of words using
str.split()
. - Convert the list of words into a
set
to automatically remove duplicates. - Get the length of the set using
len()
.
text = "apple banana apple orange banana apple" # example String
print(f"Original string: '{text}'")
# 1. Split into words
words_list = text.split()
print(f"List of words: {words_list}")
# 2. Convert to set to get unique words
unique_word_set = set(words_list)
print(f"Set of unique words: {unique_word_set}")
# 3. Get the length of the set
count_unique_words = len(unique_word_set)
print(f"Number of unique words: {count_unique_words}")
# --- Condensed Version ---
text = "hello world hello python world"
unique_word_count = len(set(text.split()))
print(f"String: '{text}'")
print(f"Unique word count: {unique_word_count}")
Output:
Original string: 'apple banana apple orange banana apple'
List of words: ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']
Set of unique words: {'orange', 'apple', 'banana'}
Number of unique words: 3
String: 'hello world hello python world' Unique word count: 3
text.split()
: Splits the string by whitespace into a list of words.set(...)
: Creates a set from the list, discarding duplicates.len(...)
: Returns the number of elements in the set (which is the count of unique words).
Using a for
Loop
Manually iterate, split, and keep track of words seen so far.
text = "apple banana apple orange banana apple"
words_list = text.split()
unique_words_list = [] # Keep track of unique words found
print(f"Original string: '{text}'")
for word in words_list:
if word not in unique_words_list: # Check if word is already seen
unique_words_list.append(word)
count_unique_loop = len(unique_words_list)
print(f"Unique words (loop): {unique_words_list}")
print(f"Unique word count (loop): {count_unique_loop}")
Output:
Original string: 'apple banana apple orange banana apple'
Unique words (loop): ['apple', 'banana', 'orange']
Unique word count (loop): 3
This is less efficient than using a set
because the word not in unique_words_list
check becomes slower as the list grows.
Count Unique Words in a Text File
Goal: Find the number of distinct words in an entire text file.
Example sample.txt
file:
this is line one
this is line two
line three has more words
Using read()
, split()
, and set()
(Recommended)
Read the whole file, split into words, and use a set.
import os
# Create dummy file for example
filename = "sample.txt"
with open(filename, "w", encoding="utf-8") as f:
f.write("this is line one\nthis is line two\nline three has more words")
unique_word_count_file = 0
unique_words_in_file = set()
try:
with open(filename, 'r', encoding='utf-8') as f:
# 1. Read entire file content
file_content = f.read()
# 2. Split content into list of words
all_words = file_content.split()
print(f"Words read from '{filename}':\n{all_words}")
# 3. Convert to set for unique words
unique_words_in_file = set(all_words)
# 4. Get the count
unique_word_count_file = len(unique_words_in_file)
print(f"\nUnique words in file: {unique_words_in_file}")
print(f"Count of unique words in file: {unique_word_count_file}")
except FileNotFoundError:
print(f"Error: File '{filename}' not found.")
except Exception as e:
print(f"An error occurred: {e}")
finally:
if os.path.exists(filename): os.remove(filename)
Output:
Words read from 'sample.txt':
['this', 'is', 'line', 'one', 'this', 'is', 'line', 'two', 'line', 'three', 'has', 'more', 'words']
Unique words in file: {'more', 'one', 'three', 'two', 'has', 'this', 'is', 'words', 'line'}
Count of unique words in file: 9
f.read()
: Reads the entire file content into a single string..split()
: Splits that string into a list of words based on whitespace.set()
andlen()
work as before.
For very large files, reading the entire content at once might consume too much memory. You might need to process the file line by line or in chunks.
Using a for
Loop
Read the file, split, and manually track unique words.
import os
filename = "sample.txt"
with open(filename, "w", encoding="utf-8") as f: f.write("this is line one\nthis is line two")
unique_words_file_loop = []
try:
with open(filename, 'r', encoding='utf-8') as f:
file_content = f.read()
all_words = file_content.split()
print(f"Words from file (loop): {all_words}")
for word in all_words:
if word not in unique_words_file_loop:
unique_words_file_loop.append(word)
print(f"Unique words (loop): {unique_words_file_loop}")
print(f"Unique word count (loop): {len(unique_words_file_loop)}")
except Exception as e:
print(f"An error occurred: {e}")
finally:
if os.path.exists(filename): os.remove(filename)
Output:
Words from file (loop): ['this', 'is', 'line', 'one', 'this', 'is', 'line', 'two']
Unique words (loop): ['this', 'is', 'line', 'one', 'two']
Unique word count (loop): 5
Count Unique Characters in a String
Goal: Find the number of distinct characters (letters, numbers, symbols, whitespace) in a string.
Using set()
(Recommended)
The most direct way. Passing a string directly to set()
treats each character as an element.
text = "programming"
print(f"Original string: '{text}'")
# Convert string directly to a set of characters
unique_char_set = set(text)
print(f"Set of unique characters: {unique_char_set}")
# Get the length of the set
unique_char_count = len(unique_char_set)
print(f"Number of unique characters: {unique_char_count}")
Output:
Original string: 'programming'
Set of unique characters: {'i', 'g', 'o', 'n', 'm', 'a', 'r', 'p'}
Number of unique characters: 8
Using dict.fromkeys()
Create dictionary keys from the characters (duplicates are automatically removed), then count the keys.
text = "programming"
print(f"Original string: '{text}'")
# Create a dictionary where keys are unique characters
unique_char_dict = dict.fromkeys(text)
print(f"Dict from keys: {unique_char_dict}")
# Count the number of keys
unique_char_count_dict = len(unique_char_dict)
print(f"Unique char count (dict): {unique_char_count_dict}")
Output:
Original string: 'programming'
Dict from keys: {'p': None, 'r': None, 'o': None, 'g': None, 'a': None, 'm': None, 'i': None, 'n': None}
Unique char count (dict): 8
While this works, using set()
is generally considered more direct for finding unique elements.
Using a for
Loop
Manually iterate through characters and track unique ones seen.
text = "programming"
unique_chars_list = []
print(f"Original string: '{text}'")
for char in text:
if char not in unique_chars_list:
unique_chars_list.append(char)
count_unique_chars_loop = len(unique_chars_list)
print(f"Unique chars (loop): {unique_chars_list}")
print(f"Unique char count (loop): {count_unique_chars_loop}")
Output:
Original string: 'programming'
Unique chars (loop): ['p', 'r', 'o', 'g', 'a', 'm', 'i', 'n']
Unique char count (loop): 8
Again, this is less efficient than using a set
.
Getting the Unique Items (Not Just the Count)
All the methods above that create a set
or a list
of unique items (unique_word_set
, unique_words_list
, unique_char_set
, unique_chars_list
) already give you the unique items themselves.
- Using
set(iterable)
is fastest for uniqueness but loses original order. - Using
dict.fromkeys(iterable).keys()
(Python 3.7+) preserves insertion order. - Using the
for
loop method preserves the order of first appearance.
To get a unique list while preserving order (first appearance), the for
loop method or more advanced techniques involving dictionaries (as ordered sets in Python 3.7+) can be used:
text = "apple banana apple orange banana apple"
words = text.split()
# Preserving order using dict keys (Python 3.7+)
unique_ordered_words = list(dict.fromkeys(words))
print(f"Unique words preserving order: {unique_ordered_words}")
Output:
Unique words preserving order: ['apple', 'banana', 'orange']
Case Sensitivity and Punctuation Considerations
- Case Sensitivity: The methods shown are case-sensitive (
'Apple'
and'apple'
are different). To count unique words ignoring case, convert the string or words to lowercase first:len(set(text.lower().split()))
. - Punctuation:
split()
separates by whitespace. Punctuation attached to words (e.g.,"word,"
,"word."
) will be treated as part of the word. To handle punctuation separately, you might need regular expressions (re.findall(r'\b\w+\b', text)
) or more advanced string cleaning before splitting and counting.
Conclusion
Counting unique words or characters in Python is efficiently done using the set
data type.
- For unique words in a string: Use
len(set(my_string.split()))
. - For unique words in a file: Use
len(set(file_content.split()))
after reading the file (f.read()
). - For unique characters in a string: Use
len(set(my_string))
.
Remember that set
removes duplicates automatically. Use .lower()
before creating the set for case-insensitive counting. If order preservation is needed, consider using dict.fromkeys()
(Python 3.7+) or a manual loop approach.