Skip to main content

How to Remove Non-UTF-8 Characters from Strings in Python

This guide explains how to remove non-UTF-8 characters from strings and files in Python.

This is a common task when dealing with text data from various sources that might contain characters outside the UTF-8 encoding.

We'll cover the correct use of encode() and decode() with error handling, and provide examples for both strings and files.

Understanding UTF-8 and Encodings

  • Unicode: A standard that assigns a unique number (a "code point") to every character in every language. It's a character set.
  • UTF-8: A variable-width character encoding capable of encoding all valid Unicode code points. It's the dominant encoding for the web and the default in Python 3. Variable-width means different characters use different numbers of bytes.
  • Encoding: The process of converting a string (Unicode characters) into a sequence of bytes (using a specific encoding like UTF-8).
  • Decoding: The process of converting a sequence of bytes into a string (using the correct encoding).

The problem arises when you have a string or file that contains bytes that don't represent valid UTF-8 characters.

Removing Non-UTF-8 Characters from a String

The core technique is to encode the string to bytes (using UTF-8), ignoring any errors, and then decode the result back to a string (again using UTF-8):

my_str = '\x86tutorialreference.com\x86' # String with invalid UTF-8 characters

result = my_str.encode(
'utf-8', errors='ignore' # Encode to bytes, IGNORE errors
).decode('utf-8') # Decode back to a string

print(result) # Output: tutorialreference.com
  • my_str.encode('utf-8', errors='ignore'): This attempts to encode the string my_str into UTF-8 bytes. The crucial part is errors='ignore'. This tells the encode() method to skip any characters that can not be encoded in UTF-8. Other options for errors include:
    • 'strict' (the default): Raises a UnicodeEncodeError.
    • 'replace' : Replaces unencodable characters with a replacement character (usually ?).
    • 'ignore' is what we want here to remove characters.
  • .decode('utf-8'): This decodes the resulting bytes (which are now guaranteed to be valid UTF-8) back into a string. Because we've already removed the problematic characters during encoding, this decoding step won't raise an error.

Removing Non-UTF-8 Characters from a File

To clean a file, read it line by line (or in chunks), encode/decode each line, and write the cleaned lines to a new file (or overwrite the original, but be careful!):

with open('example.txt', 'r', encoding='utf-8') as f:
lines = f.readlines() # Reads the files and stores it in a list of strings.

for line in lines:
line = line.encode('utf-8', errors='ignore').decode('utf-8') # Encode and decode
print(line)
  • We open the file in read mode ('r').
  • The encoding='utf-8' is for reading potentially valid UTF-8 characters correctly.
  • The loop processes each line individually.
  • You could write the cleaned lines to a new file to avoid data loss if something goes wrong.

Handling Bytes Objects

If you start with a bytes object, the process is similar, but you decode first, then encode:

my_bytes = 'tutorialreference.com'.encode('utf-8') # Valid UTF-8 bytes
# my_bytes = b'\x86tutorialreference.com\x86' # Invalid UTF-8 bytes

result = my_bytes.decode('utf-8', errors='ignore').encode('utf-8')
print(result) # Output: b'tutorialreference.com'
  • my_bytes.decode('utf-8', errors='ignore'): Attempts to decode the bytes to a string, ignoring errors.
  • .encode('utf-8'): Encodes the (potentially cleaned) string back to bytes. This ensures the result is a valid UTF-8 byte string.