How to Find Common Substrings and Characters in Python
This guide explores how to find common substrings (contiguous sequences of characters) and common characters (regardless of position) between two or more strings in Python. We'll cover:
- Finding the longest common substring using
difflib.SequenceMatcher
. - Finding a common prefix using
os.path.commonprefix
. - Finding all common characters using sets and intersections.
Finding the Longest Common Substring with SequenceMatcher
The difflib
module's SequenceMatcher
class is designed for comparing sequences (including strings) and finding similarities. Its find_longest_match()
method is perfect for finding the longest common substring.
from difflib import SequenceMatcher
string1 = 'one two three four'
string2 = 'one two nine ten'
match = SequenceMatcher(None, string1, string2).find_longest_match(
0, len(string1), 0, len(string2)
)
print(match) # Output: Match(a=0, b=0, size=8)
print(string1[match.a:match.a + match.size]) # Output: one two
print(string2[match.b:match.b + match.size]) # Output: one two
SequenceMatcher(None, string1, string2)
: Creates aSequenceMatcher
object to comparestring1
andstring2
. TheNone
argument means we're not using a custom "junk" detection function (which is fine for this task)..find_longest_match(0, len(string1), 0, len(string2))
: This finds the longest matching substring.- The arguments are
alo
,ahi
,blo
,bhi
. These define the ranges within the two strings to compare. We're comparing the entire strings, so we use0
andlen(string)
for both.
- The arguments are
match
object: Thefind_longest_match()
method returns aMatch
object (a named tuple) with three attributes:a
: The starting index of the match in the first string (string1
).b
: The starting index of the match in the second string (string2
).size
: The length of the matching substring.
Extracting the Substring
We use string slicing (string1[match.a:match.a + match.size]
) to extract the actual matching substring from either of the original strings.
The substring does not need to be at the beginning of the string. It works for any common substring:
from difflib import SequenceMatcher
string1 = 'four five one two three four'
string2 = 'zero eight one two nine ten'
match = SequenceMatcher(None, string1, string2).find_longest_match(
0, len(string1), 0, len(string2))
print(match) # Output: Match(a=10, b=11, size=7)
print(string1[match.a : match.a + match.size]) # Output: one two
print(string2[match.b : match.b + match.size]) # Output: one two
Removing Leading/Trailing Whitespace
If you want to ignore leading/trailing whitespace in the matched substring, use .strip()
:
from difflib import SequenceMatcher
string1 = 'four five one two three four'
string2 = 'zero eight one two nine ten'
match = SequenceMatcher(None, string1, string2).find_longest_match(
0, len(string1), 0, len(string2)
)
print(match) # Output: Match(a=9, b=10, size=9)
print(string1[match.a:match.a + match.size].strip()) # Output: one two
print(string2[match.b:match.b + match.size].strip()) # Output: one two
Finding the Longest Common Prefix with os.path.commonprefix
If you specifically need the longest common substring that starts at the beginning of the strings (the common prefix), use os.path.commonprefix()
:
import os
string1 = 'one two three four'
string2 = 'one two nine ten'
string3 = 'one two eight'
common_substring = os.path.commonprefix([string1, string2, string3])
print(common_substring) # Output: one two
os.path.commonprefix()
: This function is designed for finding common prefixes in file paths, but it works perfectly well for any strings. It takes a list of strings as input.- This approach only finds the common prefix. It won't find common substrings in the middle or at the end.
Finding All Common Characters (with Duplicates)
If you want to find all characters that appear in both strings, regardless of their order or position (and including duplicates), use sets and intersections:
Using Sets (Recommended)
string1 = 'abcd'
string2 = 'abzx'
common_characters = ''.join(set(string1).intersection(string2))
print(common_characters) # Output: ab (order may vary)
print(len(common_characters)) # Output: 2
set(string1)
andset(string2)
: Convert the strings to sets. This automatically removes duplicate characters within each string..intersection(...)
: Finds the common characters between the two sets.''.join(...)
: Joins the resulting set of characters back into a string. The order of characters in the output string is not guaranteed to match the original order.- Use the
len()
function to get the number of common elements.
Using List Comprehensions
string1 = 'abcd'
string2 = 'abzx'
common_characters = ''.join([
char for char in string1
if char in string2
])
print(common_characters) # Output: ab
print(len(common_characters)) # Output: 2
- List comprehension checks if the character from the first string exists in the second.
Using a for
Loop
string1 = 'abcd'
string2 = 'abzx'
common_characters = []
for char in string1:
if char in string2:
common_characters.append(char)
print(common_characters) # Output: ['a', 'b']
print(''.join(common_characters)) # Output: ab
- The
for
loop iterates over each character instring1
and then checks if the character is also present instring2
. - The common characters are added to the
common_characters
list.