Skip to main content

How to Solve xml.etree.ElementTree.ParseError: not well-formed (invalid token) in Python

The xml.etree.ElementTree.ParseError: not well-formed (invalid token) error occurs when Python's built-in XML parser (xml.etree.ElementTree) encounters invalid characters or malformed XML in the input string.

This guide explains the causes of this error and demonstrates how to resolve it, primarily by using the more robust lxml library, and secondarily by cleaning the input string.

Understanding the Error: Invalid XML

xml.etree.ElementTree (often imported as ET) is Python's built-in XML parsing library. It's relatively strict about XML validity. The not well-formed (invalid token) error means the input string:

  • Contains characters that are not allowed in XML (e.g., control characters).
  • Has malformed XML structure (e.g., mismatched tags, invalid attributes).

Incorrect Example (Raises Error):

import xml.etree.ElementTree as ET

data = """
<body>
<p>tutorialreference.com \x08</p>
</body>"""

# ⛔️ xml.etree.ElementTree.ParseError: not well-formed (invalid token)
# print(ET.XML(data))
  • The example throws an exception because of an invalid character \x08 present in the string.

The lxml library is a more powerful and forgiving XML/HTML parser. It can often handle invalid characters and some structural issues that xml.etree.ElementTree can not.

Installation:

pip install lxml
# or
pip3 install lxml

Usage:

from lxml import etree

data = """
<body>
<p>tutorialreference.com \x08</p>
</body>"""

parser = etree.XMLParser(recover=True) # Enable recovery mode
root = etree.fromstring(data, parser=parser)

print(root.tag) # Output: body
print(etree.tostring(root)) # Output: b'<body>\n <p>tutorialreference.com </p>\n</body>'
print(root.find('p').tag) # Output: p
  • etree.XMLParser(recover=True): This creates an XML parser with recovery mode enabled. This tells lxml to try its best to parse the input, even if it contains errors. It will often succeed in producing a usable parse tree, even with invalid characters.
  • etree.fromstring(data, parser=parser): Parses the string using the configured parser.
  • etree.tostring(root): Serializes the parsed XML back to a string (as bytes). Notice that lxml may have corrected some errors in the process.

lxml is generally much more robust than xml.etree.ElementTree for handling real-world, potentially malformed XML/HTML. It's the recommended solution for most parsing tasks.

Solution 2: Removing Invalid Characters (Less Reliable)

If you know which characters are causing problems, you can try to remove them before parsing with xml.etree.ElementTree. However, this is generally less reliable than using lxml and should be used with caution. You might accidentally remove valid data.

import xml.etree.ElementTree as ET

data = """
<body>
<p>tutorialreference.com \x08</p>
</body>"""

data = data.replace('\x08', '') # Remove the invalid character

root = ET.XML(data) # Parse the string using xml.etree
print(root.tag) # Output: body
print(ET.tostring(root)) # Output: b'<body>\n <p>tutorialreference.com </p>\n</body>'
print(root.find('p').tag) # Output: p

  • The replace('\x08', '') replaces the invalid character with an empty string, making the string valid for parsing using xml.etree.ElementTree.
  • You should use this only as a last resort option, as it will only work if you already know the invalid characters in the string.

Conclusion

The xml.etree.ElementTree.ParseError: not well-formed (invalid token) error in Python indicates invalid XML input.

  • The best solution is to use the lxml library with its recovery mode (etree.XMLParser(recover=True)), as it can often handle imperfect XML.
  • If you must use the built-in xml.etree.ElementTree, you can attempt to pre-process the string to remove invalid characters, but this is fragile and error-prone.
  • Always prioritize using a robust parser like lxml when dealing with potentially malformed XML/HTML.