How to Remove HTML Tags from Strings in Python

Cleaning text data often involves removing HTML tags.

This guide explores several effective methods for stripping HTML tags from strings in Python, using regular expressions, and specialized libraries like xml.etree.ElementTree, lxml, BeautifulSoup, and html.parser. We'll cover the strengths and weaknesses of each approach.

Removing HTML Tags with Regular Expressions

Regular expressions offer a powerful way to match patterns, including HTML tags and strip them from a string.

import re

html_string = """
<div>
    <ul>
    <li>tutorial</li>
    <li>reference</li>
    <li>com</li>
    </ul>
</div>
"""

pattern = re.compile('<.*?>')  # Regex pattern
result = re.sub(pattern, '', html_string)
print(result)

Output:

tutorial
reference
com

The re.compile('<.*?>') creates a regular expression pattern that matches HTML tags.
The regex pattern <.*?> matches any characters between < and >, including the brackets. .* matches anything (.) zero or more times (*), and the question mark (?) makes this a non-greedy or minimal match, otherwise the whole string from the first opening bracket to the last closing bracket would be matched as a single match.
re.sub(pattern, '', html_string) substitutes all matches of the regex pattern with an empty string (''), effectively removing them.

Removing HTML Tags with `xml.etree.ElementTree`

The xml.etree.ElementTree module is useful for XML processing, and can be applied to well-formed HTML:

import xml.etree.ElementTree as ET

html_string = """
<div>
    <ul>
    <li>tutorial</li>
    <li>reference</li>
    <li>com</li>
    </ul>
</div>
"""

result = ''.join(ET.fromstring(html_string).itertext())
print(result)

Output:

tutorial
reference
com

ET.fromstring() parses the HTML string into an element tree, which can handle well-formed XML/HTML structures.
itertext() extracts the text content from all elements as an iterator.
The ''.join(...) then joins the text contents into a single string, effectively stripping HTML tags.

Removing HTML Tags with `lxml`

The lxml library, specifically optimized for XML and HTML, can remove tags even from malformed HTML. Install it using: pip install lxml.

from lxml.html import fromstring
html_string = """
<div>
  <ul>
    <li>tutorial</li>
    <li>reference</li>
    <li>com</li>
  </ul>
</div>
"""
result = fromstring(html_string).text_content()
print(result)

Output:

tutorial
reference
com

fromstring() creates an lxml element from the HTML string.
text_content() extracts all inner text, ignoring the HTML markup.

Removing HTML Tags with `BeautifulSoup`

BeautifulSoup is a library specialized in web scraping and parsing. Install it using: pip install beautifulsoup4.

from bs4 import BeautifulSoup
html_string = """
<div>
    <ul>
    <li>tutorial</li>
    <li>reference</li>
    <li>com</li>
    </ul>
</div>
"""
result = BeautifulSoup(html_string, 'lxml').text

print(result)

Output:

tutorial
reference
com

BeautifulSoup(html_string, 'lxml') parses the HTML string into a navigable parse tree.
The .text attribute extracts all the text content, stripping HTML tags, similarly to text_content() method from lxml.

Removing HTML Tags with `HTMLParser`

The built-in html.parser provides a way to parse HTML and extract text by overriding methods:

from html.parser import HTMLParser

class HTMLTagsRemover(HTMLParser):
    def __init__(self):
        super().__init__(convert_charrefs=False)
        self.reset()
        self.fed = []

    def handle_data(self, data):
        self.fed.append(data)

    def get_data(self):
        return ''.join(self.fed)

def remove_html_tags(value):
    remover = HTMLTagsRemover()
    remover.feed(value)
    remover.close()
    return remover.get_data()

html_string = """
<div>
    <ul>
    <li>tutorial</li>
    <li>reference</li>
    <li>com</li>
    </ul>
</div>
"""

print(remove_html_tags(html_string))

Output:

tutorial
reference
com

HTMLTagsRemover extends HTMLParser overriding the handle_data method, storing the parsed strings into a list called fed.
remove_html_tags creates an instance of the class, and passes the value string to the feed method, and then returns the extracted text.

Choosing the Right Method

For simple HTML, regular expressions or xml.etree.ElementTree are often sufficient.
For potentially malformed HTML or web scraping, BeautifulSoup or lxml provide robust parsing.
HTMLParser is also good for malformed HTML but BeautifulSoup or lxml are easier to use.
If you're already using one of these libraries, it is usually best to choose the one that you are already using, unless it doesn't support your use case.

Removing HTML Tags with Regular Expressions​

Removing HTML Tags with xml.etree.ElementTree​

Removing HTML Tags with lxml​

Removing HTML Tags with BeautifulSoup​

Removing HTML Tags with HTMLParser​

Choosing the Right Method​

Table of Contents

Removing HTML Tags with Regular Expressions

Removing HTML Tags with `xml.etree.ElementTree`

Removing HTML Tags with `lxml`

Removing HTML Tags with `BeautifulSoup`

Removing HTML Tags with `HTMLParser`

Choosing the Right Method