How to Remove URLs from Text in Python
Cleaning text data often involves removing unwanted URLs.
This guide explores various methods using Python's re
module to efficiently remove URLs from strings. We'll cover using re.sub()
for direct replacement and re.findall()
with additional processing for more granular control.
Removing URLs with re.sub()
The re.sub()
method replaces occurrences of a pattern with a specified replacement string. To remove URLs, replace URL patterns with an empty string.
import re
text = """
First https://tutorialreference.com
https://google.com Second
Third https://tutorialreference.com
"""
cleaned_text = re.sub(r'http\S+', '', text, flags=re.MULTILINE)
print(cleaned_text)
output:
First
Second
Third
Explanation:
re.sub()
searches for the given regular expression pattern in the text and replaces all occurrences with the replacement string (in this case an empty string to remove them)flags=re.MULTILINE
allows the regex to correctly match across multiple lines.
This directly removes all occurrences of substrings that match the given regex.
1. Understanding the Regex Pattern
The regular expression r'http\S+'
is used in re.sub()
to match URL patterns. Let's break down the pattern:
http
: Matches the literal characters "http".\S
: Matches any character that is not a whitespace character.+
: Matches the preceding character (non-whitespace character) one or more times.
Therefore, r'http\S+'
matches any substring starting with http
followed by one or more non-whitespace characters.
2. Making the Regex More Specific
To refine the regex and match both http
and https
protocols, use https?
:
import re
text = """
First https://tutorialreference.com
https://google.com Second
Third https://tutorialreference.com
"""
cleaned_text = re.sub(r'https?://\S+', '', text)
print(cleaned_text)
output
First
Second
Third
Explanation:
https?
matches "http" or "https"://
matches the colon and double forward slashes that are in the URL.\S+
matches any sequence of non-space characters.
This will more precisely capture both http://
and https://
URLs.
Removing URLs with re.findall()
and replace()
The re.findall()
method returns all occurrences of matching patterns in a list, which you can then iterate over and replace:
import re
text = """
First https://tutorialreference.com
https://google.com Second
Third https://tutorialreference.com
"""
matches = re.findall(r'http\S+', text)
cleaned_text = text
for match in matches:
cleaned_text = cleaned_text.replace(match, '')
print(cleaned_text)
Output
First
Second
Third
re.findall()
returns a list of all strings that match the given regex.- The for loop iterates through the list and calls
replace()
to replace every URL in the original text with an empty string. - This is functionally equivalent to
re.sub()
for removing all URLs in the string.
Conditional URL Removal
Using re.findall()
in combination with conditional logic enables more control over which URLs to remove. This is useful when you want to only remove some URLs.
import re
text = """
First https://tutorialreference.com
https://google.com Second
Third https://tutorialreference.com
"""
new_text = text
matches = re.findall(r'http\S+', text)
for match in matches:
if 'google' not in match:
new_text = new_text.replace(match, '')
print(new_text)
Output
First
https://google.com Second
Third
- The for loop iterates through the URLs and removes every URL that does not contain "google" substring.