How to Remove URLs from Text in Python

Cleaning text data often involves removing unwanted URLs.

This guide explores various methods using Python's re module to efficiently remove URLs from strings. We'll cover using re.sub() for direct replacement and re.findall() with additional processing for more granular control.

Removing URLs with `re.sub()`

The re.sub() method replaces occurrences of a pattern with a specified replacement string. To remove URLs, replace URL patterns with an empty string.

import re

text = """
First https://tutorialreference.com
https://google.com Second
Third https://tutorialreference.com
"""

cleaned_text = re.sub(r'http\S+', '', text, flags=re.MULTILINE)
print(cleaned_text)

output:

First 
 Second
Third 

Explanation:

re.sub() searches for the given regular expression pattern in the text and replaces all occurrences with the replacement string (in this case an empty string to remove them)
flags=re.MULTILINE allows the regex to correctly match across multiple lines.

This directly removes all occurrences of substrings that match the given regex.

1. Understanding the Regex Pattern

The regular expression r'http\S+' is used in re.sub() to match URL patterns. Let's break down the pattern:

http: Matches the literal characters "http".
\S: Matches any character that is not a whitespace character.
+: Matches the preceding character (non-whitespace character) one or more times.

Therefore, r'http\S+' matches any substring starting with http followed by one or more non-whitespace characters.

2. Making the Regex More Specific

To refine the regex and match both http and https protocols, use https? :

import re

text = """
First https://tutorialreference.com
https://google.com Second
Third https://tutorialreference.com
"""

cleaned_text = re.sub(r'https?://\S+', '', text)
print(cleaned_text)

output

First 
 Second
Third 

Explanation:

https? matches "http" or "https"
:// matches the colon and double forward slashes that are in the URL.
\S+ matches any sequence of non-space characters.

This will more precisely capture both http:// and https:// URLs.

Removing URLs with `re.findall()` and `replace()`

The re.findall() method returns all occurrences of matching patterns in a list, which you can then iterate over and replace:

import re

text = """
First https://tutorialreference.com
https://google.com Second
Third https://tutorialreference.com
"""

matches = re.findall(r'http\S+', text)
cleaned_text = text
for match in matches:
    cleaned_text = cleaned_text.replace(match, '')
print(cleaned_text)

Output

First 
 Second
Third 

re.findall() returns a list of all strings that match the given regex.
The for loop iterates through the list and calls replace() to replace every URL in the original text with an empty string.
This is functionally equivalent to re.sub() for removing all URLs in the string.

Conditional URL Removal

Using re.findall() in combination with conditional logic enables more control over which URLs to remove. This is useful when you want to only remove some URLs.

import re

text = """
First https://tutorialreference.com
https://google.com Second
Third https://tutorialreference.com
"""

new_text = text
matches = re.findall(r'http\S+', text)

for match in matches:
    if 'google' not in match:
        new_text = new_text.replace(match, '')
print(new_text)

Output

First 
https://google.com Second
Third 

The for loop iterates through the URLs and removes every URL that does not contain "google" substring.

How to Remove URLs from Text in Python

Removing URLs with re.sub()​

1. Understanding the Regex Pattern​

2. Making the Regex More Specific​

Removing URLs with re.findall() and replace()​

Conditional URL Removal​

Table of Contents

Removing URLs with `re.sub()`

1. Understanding the Regex Pattern

2. Making the Regex More Specific

Removing URLs with `re.findall()` and `replace()`

Conditional URL Removal