How to Remove Non-UTF-8 Characters from Strings in Python
This guide explains how to remove non-UTF-8 characters from strings and files in Python.
This is a common task when dealing with text data from various sources that might contain characters outside the UTF-8 encoding.
We'll cover the correct use of encode()
and decode()
with error handling, and provide examples for both strings and files.
Understanding UTF-8 and Encodings
- Unicode: A standard that assigns a unique number (a "code point") to every character in every language. It's a character set.
- UTF-8: A variable-width character encoding capable of encoding all valid Unicode code points. It's the dominant encoding for the web and the default in Python 3. Variable-width means different characters use different numbers of bytes.
- Encoding: The process of converting a string (Unicode characters) into a sequence of bytes (using a specific encoding like UTF-8).
- Decoding: The process of converting a sequence of bytes into a string (using the correct encoding).
The problem arises when you have a string or file that contains bytes that don't represent valid UTF-8 characters.
Removing Non-UTF-8 Characters from a String
The core technique is to encode the string to bytes (using UTF-8), ignoring any errors, and then decode the result back to a string (again using UTF-8):
my_str = '\x86tutorialreference.com\x86' # String with invalid UTF-8 characters
result = my_str.encode(
'utf-8', errors='ignore' # Encode to bytes, IGNORE errors
).decode('utf-8') # Decode back to a string
print(result) # Output: tutorialreference.com
my_str.encode('utf-8', errors='ignore')
: This attempts to encode the stringmy_str
into UTF-8 bytes. The crucial part iserrors='ignore'
. This tells theencode()
method to skip any characters that can not be encoded in UTF-8. Other options forerrors
include:'strict'
(the default): Raises aUnicodeEncodeError
.'replace'
: Replaces unencodable characters with a replacement character (usually?
).'ignore'
is what we want here to remove characters.
.decode('utf-8')
: This decodes the resulting bytes (which are now guaranteed to be valid UTF-8) back into a string. Because we've already removed the problematic characters during encoding, this decoding step won't raise an error.
Removing Non-UTF-8 Characters from a File
To clean a file, read it line by line (or in chunks), encode/decode each line, and write the cleaned lines to a new file (or overwrite the original, but be careful!):
with open('example.txt', 'r', encoding='utf-8') as f:
lines = f.readlines() # Reads the files and stores it in a list of strings.
for line in lines:
line = line.encode('utf-8', errors='ignore').decode('utf-8') # Encode and decode
print(line)
- We open the file in read mode (
'r'
). - The
encoding='utf-8'
is for reading potentially valid UTF-8 characters correctly. - The loop processes each line individually.
- You could write the cleaned lines to a new file to avoid data loss if something goes wrong.
Handling Bytes Objects
If you start with a bytes object, the process is similar, but you decode first, then encode:
my_bytes = 'tutorialreference.com'.encode('utf-8') # Valid UTF-8 bytes
# my_bytes = b'\x86tutorialreference.com\x86' # Invalid UTF-8 bytes
result = my_bytes.decode('utf-8', errors='ignore').encode('utf-8')
print(result) # Output: b'tutorialreference.com'
my_bytes.decode('utf-8', errors='ignore')
: Attempts to decode the bytes to a string, ignoring errors..encode('utf-8')
: Encodes the (potentially cleaned) string back to bytes. This ensures the result is a valid UTF-8 byte string.