Skip to main content

Python Pandas: How to Replace Entire String in Column if Contains Substring

When cleaning or transforming text data in Pandas, a common task is to replace the entire string value in a column if it contains a specific substring. For example, you might want to standardize job titles by replacing any title containing "Dev" or "Engineer" with a generic "Developer" or "Engineering Role".

This guide explains several effective methods to achieve this conditional whole-string replacement in Pandas, using str.contains(), str.replace() with regex, and Series.apply().

The Goal: Conditional Full String Replacement

Given a Pandas DataFrame and a specific column containing string values, we want to identify rows where the string in that column includes a certain substring. For all such identified rows, we want to replace the entire string value in that column with a new, specified string.

Example DataFrame

import pandas as pd
import re # For regex flags

data = {
'EmployeeID': [101, 102, 103, 104, 105, 106],
'JobTitle_Original': [
'Software Developer',
'Senior Web Developer',
'Data Analyst',
'DevOps Engineer',
'Project Manager (Dev)',
'Lead Analyst'
],
'Department': ['Tech', 'Tech', 'Analytics', 'Tech', 'Management', 'Analytics']
}

df_original = pd.DataFrame(data)
print("Original DataFrame:")
print(df_original)

Output:

Original DataFrame:
EmployeeID JobTitle_Original Department
0 101 Software Developer Tech
1 102 Senior Web Developer Tech
2 103 Data Analyst Analytics
3 104 DevOps Engineer Tech
4 105 Project Manager (Dev) Management
5 106 Lead Analyst Analytics

Our goal is to replace any JobTitle_Original containing "Dev" or "Developer" (case-insensitive) with a standardized "Software Engineering Role".

This is often the most readable and idiomatic Pandas approach.

  1. Use Series.str.contains(substring) to create a boolean mask identifying rows where the column contains the substring.
  2. Use DataFrame.loc[mask, 'ColumnName'] = new_value to assign the new value to the specified column for the matched rows.

Case-Sensitive Replacement

import pandas as pd

df = df_original.copy() # Work on a copy

substring_to_find = 'Dev' # Case-sensitive
new_value = 'Standardized Developer Role'

# Create a boolean mask for rows where 'JobTitle_Original' contains 'Dev'
mask = df['JobTitle_Original'].str.contains(substring_to_find, na=False) # na=False treats NaN as no match
print("Boolean mask for 'Dev':")
print(mask)
print()

# ✅ Use .loc to replace the entire string in 'JobTitle_Original' where mask is True
df.loc[mask, 'JobTitle_Original'] = new_value

print("DataFrame after case-sensitive replacement using str.contains():")
print(df)

Output:

Boolean mask for 'Dev':
0 True
1 True
2 False
3 True
4 True
5 False
Name: JobTitle_Original, dtype: bool

DataFrame after case-sensitive replacement using str.contains():
EmployeeID JobTitle_Original Department
0 101 Standardized Developer Role Tech
1 102 Standardized Developer Role Tech
2 103 Data Analyst Analytics
3 104 Standardized Developer Role Tech
4 105 Standardized Developer Role Management
5 106 Lead Analyst Analytics
  • na=False in str.contains() ensures that if your column has NaN values, they don't cause errors and are treated as not containing the substring.

Case-Insensitive Replacement (case=False)

Set the case=False parameter in str.contains().

import pandas as pd

df = df_original.copy()

substring_to_find_ci = 'dev' # Search for 'dev' case-insensitively
new_value_ci = 'Software Engineering Role'

# ✅ Case-insensitive mask
mask_ci = df['JobTitle_Original'].str.contains(substring_to_find_ci, case=False, na=False)
print("Boolean mask for 'dev' (case-insensitive):")
print(mask_ci)
print()

df.loc[mask_ci, 'JobTitle_Original'] = new_value_ci

print("DataFrame after case-insensitive replacement using str.contains():")
print(df)

Output:

Boolean mask for 'dev' (case-insensitive):
0 True
1 True
2 False
3 True
4 True
5 False
Name: JobTitle_Original, dtype: bool

DataFrame after case-insensitive replacement using str.contains():
EmployeeID JobTitle_Original Department
0 101 Software Engineering Role Tech
1 102 Software Engineering Role Tech
2 103 Data Analyst Analytics
3 104 Software Engineering Role Tech
4 105 Software Engineering Role Management
5 106 Lead Analyst Analytics

Method 2: Using Series.str.replace() with Regular Expressions (Regex)

The Series.str.replace(pattern, replacement_string, regex=True) method can replace occurrences of a pattern (which can be a regex) with replacement_string. To replace the entire string if a substring is found, the regex pattern needs to match the whole string while ensuring the substring is present.

Case-Sensitive Replacement

import pandas as pd

df = df_original.copy()

# Regex: '^.*Dev.*$'
# ^ : matches the start of the string
# .* : matches any character (except newline) zero or more times
# Dev : the literal substring to find
# .* : matches any character zero or more times again
# $ : matches the end of the string
# This pattern ensures the *entire string* contains 'Dev'
regex_pattern = r'^.*Dev.*$'
new_value_regex = 'Standardized Developer (Regex)'

# ✅ Replace entire string if it matches the regex pattern
df['JobTitle_Original'] = df['JobTitle_Original'].str.replace(
regex_pattern,
new_value_regex,
regex=True
)

print("DataFrame after case-sensitive replacement using str.replace() with regex:")
print(df)

# A more general regex to match 'Dev' anywhere: r'.*Dev.*' and then use a conditional replacement
# or use the str.contains approach for simplicity if just replacing the whole string.
# For str.replace to replace the *whole* string, the regex must match the *whole* string.
# If the regex r'Dev' was used, only 'Dev' would be replaced, not the whole string.

Output:

DataFrame after case-sensitive replacement using str.replace() with regex:
EmployeeID JobTitle_Original Department
0 101 Standardized Developer (Regex) Tech
1 102 Standardized Developer (Regex) Tech
2 103 Data Analyst Analytics
3 104 Standardized Developer (Regex) Tech
4 105 Standardized Developer (Regex) Management
5 106 Lead Analyst Analytics
note

The regex r'(^.*Dev.*$)' is designed to match the entire string if "Dev" is present. If you just used r'Dev', only the "Dev" part would be replaced. For whole string replacement based on a substring, Method 1 (str.contains + .loc) is often simpler.

Case-Insensitive Replacement (flags=re.IGNORECASE)

Use the flags parameter with re.IGNORECASE.

import pandas as pd
import re

df = df_original.copy()

regex_pattern_ci = r'^.*dev.*$' # Use lowercase in pattern for consistency with IGNORECASE
new_value_regex_ci = 'Software Engineering (Regex CI)'

df['JobTitle_Original'] = df['JobTitle_Original'].str.replace(
regex_pattern_ci,
new_value_regex_ci,
regex=True,
flags=re.IGNORECASE # ✅ Case-insensitive flag
)

print("DataFrame after case-insensitive replacement (str.replace() with regex):")
print(df)

Output:

DataFrame after case-insensitive replacement (str.replace() with regex):
EmployeeID JobTitle_Original Department
0 101 Software Engineering (Regex CI) Tech
1 102 Software Engineering (Regex CI) Tech
2 103 Data Analyst Analytics
3 104 Software Engineering (Regex CI) Tech
4 105 Software Engineering (Regex CI) Management
5 106 Lead Analyst Analytics

Method 3: Using Series.apply() with a Lambda Function

The Series.apply() method can apply a custom function (or a lambda function) to each element of the Series.

Case-Sensitive Replacement

import pandas as pd

df = df_original.copy()

substring_to_check = 'Dev'
replacement_string = 'Developer via Apply'

# ✅ Apply a lambda function to check and replace
df['JobTitle_Original'] = df['JobTitle_Original'].apply(
lambda current_string: replacement_string if isinstance(current_string, str) and substring_to_check in current_string else current_string
)

print("DataFrame after case-sensitive replacement using apply():")
print(df)

Output:

DataFrame after case-sensitive replacement using apply():
EmployeeID JobTitle_Original Department
0 101 Developer via Apply Tech
1 102 Developer via Apply Tech
2 103 Data Analyst Analytics
3 104 Developer via Apply Tech
4 105 Developer via Apply Management
5 106 Lead Analyst Analytics
  • isinstance(current_string, str): Good practice to check if the element is a string before applying string methods, especially if the column might contain NaN or other types.

Case-Insensitive Replacement (using .lower())

Convert both the current string and the substring to a consistent case (e.g., lowercase) for comparison.

import pandas as pd

df = df_original.copy()

substring_lower = 'dev'
replacement_ci_apply = 'Software Engineer (Apply CI)'

df['JobTitle_Original'] = df['JobTitle_Original'].apply(
lambda current_string: replacement_ci_apply if isinstance(current_string, str) and substring_lower in current_string.lower() else current_string
)

print("DataFrame after case-insensitive replacement using apply():")
print(df)

Output:

DataFrame after case-insensitive replacement using apply():
EmployeeID JobTitle_Original Department
0 101 Software Engineer (Apply CI) Tech
1 102 Software Engineer (Apply CI) Tech
2 103 Data Analyst Analytics
3 104 Software Engineer (Apply CI) Tech
4 105 Software Engineer (Apply CI) Management
5 106 Lead Analyst Analytics
note

apply() is very flexible but can be less performant than vectorized string methods (.str.contains, .str.replace) for very large DataFrames.

Choosing the Right Method

  • Series.str.contains() with df.loc[] assignment (Method 1): Generally the most recommended for this specific task. It's clear, idiomatic Pandas, and handles case-insensitivity easily. Performance is usually very good.
  • Series.str.replace() with regex (Method 2): More powerful if you need complex pattern matching to define what part of the string to look for before replacing the whole string. Constructing a regex that correctly matches the entire string when a substring is present (like r'^.*substring.*$') is key.
  • Series.apply() (Method 3): Most flexible for complex custom logic that goes beyond simple substring checks or regex. However, it can be slower on large datasets compared to the vectorized .str methods.

Conclusion

To replace an entire string value in a Pandas DataFrame column if it contains a specific substring:

  1. Use df.loc[df['column'].str.contains(substring, case=..., na=False), 'column'] = new_value. This is typically the cleanest and most efficient approach.
  2. Alternatively, df['column'].str.replace(r'^.*substring_pattern.*$', new_value, regex=True, flags=...) can be used with a carefully crafted regular expression that matches the entire string.
  3. df['column'].apply(lambda x: ...) provides maximum flexibility for custom replacement logic but might be less performant.

Remember to handle case sensitivity (case=False or flags=re.IGNORECASE) and potential NaN values (na=False in str.contains) as needed for your specific use case.