Skip to main content

How to Find Common Substrings and Characters in Python

This guide explores how to find common substrings (contiguous sequences of characters) and common characters (regardless of position) between two or more strings in Python. We'll cover:

  • Finding the longest common substring using difflib.SequenceMatcher.
  • Finding a common prefix using os.path.commonprefix.
  • Finding all common characters using sets and intersections.

Finding the Longest Common Substring with SequenceMatcher

The difflib module's SequenceMatcher class is designed for comparing sequences (including strings) and finding similarities. Its find_longest_match() method is perfect for finding the longest common substring.

from difflib import SequenceMatcher

string1 = 'one two three four'
string2 = 'one two nine ten'

match = SequenceMatcher(None, string1, string2).find_longest_match(
0, len(string1), 0, len(string2)
)

print(match) # Output: Match(a=0, b=0, size=8)
print(string1[match.a:match.a + match.size]) # Output: one two
print(string2[match.b:match.b + match.size]) # Output: one two

  • SequenceMatcher(None, string1, string2): Creates a SequenceMatcher object to compare string1 and string2. The None argument means we're not using a custom "junk" detection function (which is fine for this task).
  • .find_longest_match(0, len(string1), 0, len(string2)): This finds the longest matching substring.
    • The arguments are alo, ahi, blo, bhi. These define the ranges within the two strings to compare. We're comparing the entire strings, so we use 0 and len(string) for both.
  • match object: The find_longest_match() method returns a Match object (a named tuple) with three attributes:
    • a: The starting index of the match in the first string (string1).
    • b: The starting index of the match in the second string (string2).
    • size: The length of the matching substring.

Extracting the Substring

We use string slicing (string1[match.a:match.a + match.size]) to extract the actual matching substring from either of the original strings.

The substring does not need to be at the beginning of the string. It works for any common substring:

from difflib import SequenceMatcher
string1 = 'four five one two three four'
string2 = 'zero eight one two nine ten'

match = SequenceMatcher(None, string1, string2).find_longest_match(
0, len(string1), 0, len(string2))

print(match) # Output: Match(a=10, b=11, size=7)

print(string1[match.a : match.a + match.size]) # Output: one two
print(string2[match.b : match.b + match.size]) # Output: one two

Removing Leading/Trailing Whitespace

If you want to ignore leading/trailing whitespace in the matched substring, use .strip():

from difflib import SequenceMatcher

string1 = 'four five one two three four'
string2 = 'zero eight one two nine ten'

match = SequenceMatcher(None, string1, string2).find_longest_match(
0, len(string1), 0, len(string2)
)

print(match) # Output: Match(a=9, b=10, size=9)

print(string1[match.a:match.a + match.size].strip()) # Output: one two
print(string2[match.b:match.b + match.size].strip()) # Output: one two

Finding the Longest Common Prefix with os.path.commonprefix

If you specifically need the longest common substring that starts at the beginning of the strings (the common prefix), use os.path.commonprefix():

import os

string1 = 'one two three four'
string2 = 'one two nine ten'
string3 = 'one two eight'

common_substring = os.path.commonprefix([string1, string2, string3])
print(common_substring) # Output: one two
  • os.path.commonprefix(): This function is designed for finding common prefixes in file paths, but it works perfectly well for any strings. It takes a list of strings as input.
  • This approach only finds the common prefix. It won't find common substrings in the middle or at the end.

Finding All Common Characters (with Duplicates)

If you want to find all characters that appear in both strings, regardless of their order or position (and including duplicates), use sets and intersections:

string1 = 'abcd'
string2 = 'abzx'
common_characters = ''.join(set(string1).intersection(string2))
print(common_characters) # Output: ab (order may vary)
print(len(common_characters)) # Output: 2
  • set(string1) and set(string2): Convert the strings to sets. This automatically removes duplicate characters within each string.
  • .intersection(...): Finds the common characters between the two sets.
  • ''.join(...): Joins the resulting set of characters back into a string. The order of characters in the output string is not guaranteed to match the original order.
  • Use the len() function to get the number of common elements.

Using List Comprehensions

string1 = 'abcd'
string2 = 'abzx'

common_characters = ''.join([
char for char in string1
if char in string2
])
print(common_characters) # Output: ab
print(len(common_characters)) # Output: 2
  • List comprehension checks if the character from the first string exists in the second.

Using a for Loop

string1 = 'abcd'
string2 = 'abzx'
common_characters = []

for char in string1:
if char in string2:
common_characters.append(char)

print(common_characters) # Output: ['a', 'b']
print(''.join(common_characters)) # Output: ab
  • The for loop iterates over each character in string1 and then checks if the character is also present in string2.
  • The common characters are added to the common_characters list.