Skip to main content

How to Remove URLs from Text in Python

Cleaning text data often involves removing unwanted URLs.

This guide explores various methods using Python's re module to efficiently remove URLs from strings. We'll cover using re.sub() for direct replacement and re.findall() with additional processing for more granular control.

Removing URLs with re.sub()

The re.sub() method replaces occurrences of a pattern with a specified replacement string. To remove URLs, replace URL patterns with an empty string.

import re

text = """
First https://tutorialreference.com
https://google.com Second
Third https://tutorialreference.com
"""

cleaned_text = re.sub(r'http\S+', '', text, flags=re.MULTILINE)
print(cleaned_text)

output:

First 
Second
Third

Explanation:

  • re.sub() searches for the given regular expression pattern in the text and replaces all occurrences with the replacement string (in this case an empty string to remove them)
  • flags=re.MULTILINE allows the regex to correctly match across multiple lines.

This directly removes all occurrences of substrings that match the given regex.

1. Understanding the Regex Pattern

The regular expression r'http\S+' is used in re.sub() to match URL patterns. Let's break down the pattern:

  • http: Matches the literal characters "http".
  • \S: Matches any character that is not a whitespace character.
  • +: Matches the preceding character (non-whitespace character) one or more times.

Therefore, r'http\S+' matches any substring starting with http followed by one or more non-whitespace characters.

2. Making the Regex More Specific

To refine the regex and match both http and https protocols, use https? :

import re

text = """
First https://tutorialreference.com
https://google.com Second
Third https://tutorialreference.com
"""

cleaned_text = re.sub(r'https?://\S+', '', text)
print(cleaned_text)

output

First 
Second
Third

Explanation:

  • https? matches "http" or "https"
  • :// matches the colon and double forward slashes that are in the URL.
  • \S+ matches any sequence of non-space characters.

This will more precisely capture both http:// and https:// URLs.

Removing URLs with re.findall() and replace()

The re.findall() method returns all occurrences of matching patterns in a list, which you can then iterate over and replace:

import re

text = """
First https://tutorialreference.com
https://google.com Second
Third https://tutorialreference.com
"""

matches = re.findall(r'http\S+', text)
cleaned_text = text
for match in matches:
cleaned_text = cleaned_text.replace(match, '')
print(cleaned_text)

Output

First 
Second
Third
  • re.findall() returns a list of all strings that match the given regex.
  • The for loop iterates through the list and calls replace() to replace every URL in the original text with an empty string.
  • This is functionally equivalent to re.sub() for removing all URLs in the string.

Conditional URL Removal

Using re.findall() in combination with conditional logic enables more control over which URLs to remove. This is useful when you want to only remove some URLs.

import re

text = """
First https://tutorialreference.com
https://google.com Second
Third https://tutorialreference.com
"""

new_text = text
matches = re.findall(r'http\S+', text)

for match in matches:
if 'google' not in match:
new_text = new_text.replace(match, '')
print(new_text)

Output

First 
https://google.com Second
Third
  • The for loop iterates through the URLs and removes every URL that does not contain "google" substring.