How to Remove HTML Tags from Strings in Python
Cleaning text data often involves removing HTML tags.
This guide explores several effective methods for stripping HTML tags from strings in Python, using regular expressions, and specialized libraries like xml.etree.ElementTree
, lxml
, BeautifulSoup
, and html.parser
. We'll cover the strengths and weaknesses of each approach.
Removing HTML Tags with Regular Expressions
Regular expressions offer a powerful way to match patterns, including HTML tags and strip them from a string.
import re
html_string = """
<div>
<ul>
<li>tutorial</li>
<li>reference</li>
<li>com</li>
</ul>
</div>
"""
pattern = re.compile('<.*?>') # Regex pattern
result = re.sub(pattern, '', html_string)
print(result)
Output:
tutorial
reference
com
- The
re.compile('<.*?>')
creates a regular expression pattern that matches HTML tags. - The regex pattern
<.*?>
matches any characters between<
and>
, including the brackets..*
matches anything (.
) zero or more times (*
), and the question mark (?
) makes this a non-greedy or minimal match, otherwise the whole string from the first opening bracket to the last closing bracket would be matched as a single match. re.sub(pattern, '', html_string)
substitutes all matches of the regex pattern with an empty string (''
), effectively removing them.
Removing HTML Tags with xml.etree.ElementTree
The xml.etree.ElementTree
module is useful for XML processing, and can be applied to well-formed HTML:
import xml.etree.ElementTree as ET
html_string = """
<div>
<ul>
<li>tutorial</li>
<li>reference</li>
<li>com</li>
</ul>
</div>
"""
result = ''.join(ET.fromstring(html_string).itertext())
print(result)
Output:
tutorial
reference
com
ET.fromstring()
parses the HTML string into an element tree, which can handle well-formed XML/HTML structures.itertext()
extracts the text content from all elements as an iterator.- The
''.join(...)
then joins the text contents into a single string, effectively stripping HTML tags.
Removing HTML Tags with lxml
The lxml
library, specifically optimized for XML and HTML, can remove tags even from malformed HTML. Install it using: pip install lxml
.
from lxml.html import fromstring
html_string = """
<div>
<ul>
<li>tutorial</li>
<li>reference</li>
<li>com</li>
</ul>
</div>
"""
result = fromstring(html_string).text_content()
print(result)
Output:
tutorial
reference
com
fromstring()
creates anlxml
element from the HTML string.text_content()
extracts all inner text, ignoring the HTML markup.
Removing HTML Tags with BeautifulSoup
BeautifulSoup
is a library specialized in web scraping and parsing. Install it using: pip install beautifulsoup4
.
from bs4 import BeautifulSoup
html_string = """
<div>
<ul>
<li>tutorial</li>
<li>reference</li>
<li>com</li>
</ul>
</div>
"""
result = BeautifulSoup(html_string, 'lxml').text
print(result)
Output:
tutorial
reference
com
BeautifulSoup(html_string, 'lxml')
parses the HTML string into a navigable parse tree.- The
.text
attribute extracts all the text content, stripping HTML tags, similarly totext_content()
method fromlxml
.
Removing HTML Tags with HTMLParser
The built-in html.parser
provides a way to parse HTML and extract text by overriding methods:
from html.parser import HTMLParser
class HTMLTagsRemover(HTMLParser):
def __init__(self):
super().__init__(convert_charrefs=False)
self.reset()
self.fed = []
def handle_data(self, data):
self.fed.append(data)
def get_data(self):
return ''.join(self.fed)
def remove_html_tags(value):
remover = HTMLTagsRemover()
remover.feed(value)
remover.close()
return remover.get_data()
html_string = """
<div>
<ul>
<li>tutorial</li>
<li>reference</li>
<li>com</li>
</ul>
</div>
"""
print(remove_html_tags(html_string))
Output:
tutorial
reference
com
HTMLTagsRemover
extendsHTMLParser
overriding thehandle_data
method, storing the parsed strings into a list calledfed
.remove_html_tags
creates an instance of the class, and passes thevalue
string to thefeed
method, and then returns the extracted text.
Choosing the Right Method
- For simple HTML, regular expressions or
xml.etree.ElementTree
are often sufficient. - For potentially malformed HTML or web scraping,
BeautifulSoup
orlxml
provide robust parsing. HTMLParser
is also good for malformed HTML butBeautifulSoup
orlxml
are easier to use.- If you're already using one of these libraries, it is usually best to choose the one that you are already using, unless it doesn't support your use case.