How to get the base of a URL in Python

When working with URLs in Python, you often need to isolate the base URL (e.g., the domain name with scheme, like https://tutorialreference.com).

This tutorial will guide you through the process of extracting the base URL from a complete URL using the urllib.parse module, along with techniques for handling URL paths.

Extracting the Base URL with `urlparse`

The urllib.parse module provides the urlparse() function to break down a URL into its components. To extract the base URL (including the scheme) you can use the netloc attribute of the parsed URL.

from urllib.parse import urlparse

my_url = 'https://tutorialreference.com/images/wallpaper.jpg'
parsed_url = urlparse(my_url)
base_url = parsed_url.netloc
print(base_url)  # Output: tutorialreference.com

This simple code snippet effectively extracts the domain name (with the scheme, in the URL case).

Understanding the `ParseResult` Object

The urlparse() function returns a ParseResult object containing various components of the URL, such as the scheme, netloc (base URL), path, query, and more:

from urllib.parse import urlparse

my_url = 'https://tutorialreference.com/images/wallpaper.jpg'
parsed_url = urlparse(my_url)
print(parsed_url)
# Output: ParseResult(scheme='https', netloc='tutorialreference.com', path='/images/wallpaper.jpg', params='', query='', fragment='')

You can access these components using their corresponding attributes:

from urllib.parse import urlparse

my_url = 'https://tutorialreference.com/images/wallpaper.jpg'
parsed = urlparse(my_url)

base = parsed.netloc
print(base)  # Output: tutorialreference.com

path = parsed.path
print(path)  # Output: /images/wallpaper.jpg

Excluding Path Portions from the Result

Often, you'll want the base URL along with only the initial part of the path. You can achieve this using string manipulation with rsplit() or split():

from urllib.parse import urlparse

my_url = 'https://tutorialreference.com/images/wallpaper.jpg'
parsed_url = urlparse(my_url)

base = parsed_url.netloc
path = parsed_url.path

with_path = base + '/'.join(path.rsplit('/', 1)[:-1])
print(with_path) # Output: tutorialreference.com/images

note

The rsplit('/', 1) method splits the path string from the right side, at most 1 time. The result is a list. Using [:-1] returns a list with all but the last element. Finally we are joining the items in the resulting list with '/' character into a string, and appending it to the base url.

Handling Deeply Nested URL Paths

If your URLs have deeply nested paths, you can modify the rsplit() to remove more levels:

from urllib.parse import urlparse

my_url = 'https://tutorialreference.com/images/nature/wallpaper.jpg'
parsed_url = urlparse(my_url)

base = parsed_url.netloc
path = parsed_url.path

with_path = base + path.rsplit('/', 2)[0]
print(with_path)  # Output: tutorialreference.com/images

note

The rsplit('/', 2) splits the path string at most 2 times from the right. Index [0] returns the string containing all of the path except for the last two folders.

Extracting the Base URL with urlparse​

Understanding the ParseResult Object​

Excluding Path Portions from the Result​

Handling Deeply Nested URL Paths​

Table of Contents

Extracting the Base URL with `urlparse`

Understanding the `ParseResult` Object

Excluding Path Portions from the Result

Handling Deeply Nested URL Paths