How to get the base of a URL in Python
When working with URLs in Python, you often need to isolate the base URL (e.g., the domain name with scheme, like https://tutorialreference.com
).
This tutorial will guide you through the process of extracting the base URL from a complete URL using the urllib.parse
module, along with techniques for handling URL paths.
Extracting the Base URL with urlparse
The urllib.parse
module provides the urlparse()
function to break down a URL into its components. To extract the base URL (including the scheme) you can use the netloc
attribute of the parsed URL.
from urllib.parse import urlparse
my_url = 'https://tutorialreference.com/images/wallpaper.jpg'
parsed_url = urlparse(my_url)
base_url = parsed_url.netloc
print(base_url) # Output: tutorialreference.com
This simple code snippet effectively extracts the domain name (with the scheme, in the URL case).
Understanding the ParseResult
Object
The urlparse()
function returns a ParseResult
object containing various components of the URL, such as the scheme, netloc (base URL), path, query, and more:
from urllib.parse import urlparse
my_url = 'https://tutorialreference.com/images/wallpaper.jpg'
parsed_url = urlparse(my_url)
print(parsed_url)
# Output: ParseResult(scheme='https', netloc='tutorialreference.com', path='/images/wallpaper.jpg', params='', query='', fragment='')
You can access these components using their corresponding attributes:
from urllib.parse import urlparse
my_url = 'https://tutorialreference.com/images/wallpaper.jpg'
parsed = urlparse(my_url)
base = parsed.netloc
print(base) # Output: tutorialreference.com
path = parsed.path
print(path) # Output: /images/wallpaper.jpg
Excluding Path Portions from the Result
Often, you'll want the base URL along with only the initial part of the path. You can achieve this using string manipulation with rsplit()
or split()
:
from urllib.parse import urlparse
my_url = 'https://tutorialreference.com/images/wallpaper.jpg'
parsed_url = urlparse(my_url)
base = parsed_url.netloc
path = parsed_url.path
with_path = base + '/'.join(path.rsplit('/', 1)[:-1])
print(with_path) # Output: tutorialreference.com/images
The rsplit('/', 1)
method splits the path string from the right side, at most 1 time. The result is a list. Using [:-1]
returns a list with all but the last element. Finally we are joining the items in the resulting list with '/'
character into a string, and appending it to the base url.
Handling Deeply Nested URL Paths
If your URLs have deeply nested paths, you can modify the rsplit()
to remove more levels:
from urllib.parse import urlparse
my_url = 'https://tutorialreference.com/images/nature/wallpaper.jpg'
parsed_url = urlparse(my_url)
base = parsed_url.netloc
path = parsed_url.path
with_path = base + path.rsplit('/', 2)[0]
print(with_path) # Output: tutorialreference.com/images
The rsplit('/', 2)
splits the path string at most 2 times from the right. Index [0]
returns the string containing all of the path except for the last two folders.