How to Remove \ufeff (BOM) Characters from Strings in Python

The \ufeff character, known as the Byte Order Mark (BOM), can sometimes appear at the beginning of text files, causing unexpected issues.

This guide explains how to remove or handle the BOM character effectively in Python, including using string replacement and proper encoding when opening files.

Removing \ufeff with `str.replace()`

The simplest way to remove the BOM character from a string is using the str.replace() method:

my_str = '\ufefffirst line'
result = my_str.replace('\ufeff', '')
print(repr(result))  # Output: 'first line'

The replace('\ufeff', '') method directly replaces all occurrences of the BOM character in the string with an empty string, thus removing it from the string, ensuring it won't interfere with text processing.

Handling BOM Characters When Opening Files

If encountering the BOM character when reading from a file, specify the utf-8-sig encoding when opening the file using the open() function:

with open('example.txt', 'r', encoding='utf-8-sig') as f:
    lines = f.readlines()
    print(lines) # Output will depend on file contents, but will not contain the BOM

The encoding='utf-8-sig' argument automatically handles the BOM character upon reading.
This avoids manually removing the character using str.replace().

Understanding BOM and Encoding

The BOM character is used to signal the byte order (endianness) of a text file encoded in formats like UTF-16 and UTF-32. However, it's generally not necessary or recommended for UTF-8 encoded files.

Encoding with BOM: UTF-8-SIG and UTF-16

UTF-8-SIG: The utf-8-sig encoding prepends files with a BOM character and automatically removes it during decoding.

my_bytes = 'tutorialreference.com'.encode('utf-8-sig') # Encode with BOM
print(my_bytes)  # Output: b'\xef\xbb\xbftutorialreference.com'

my_str = my_bytes.decode('utf-8-sig') # Decode with utf-8-sig
print(my_str)    # Output: tutorialreference.com

The BOM is added during encoding but removed when using 'utf-8-sig' to decode.

UTF-16: Similar to UTF-8-SIG, UTF-16 also encodes the BOM and removes it automatically using the same codec to decode:

my_bytes = 'tutorialreference.com'.encode('utf-16') # Encode with BOM
print(my_bytes) # Output (Will vary depending on system): b'\xff\xfet\x00u\x00t\x00o\x00r\x00i\x00a\x00l\x00r\x00e\x00f\x00e\x00r\x00e\x00n\x00c\x00e\x00.\x00c\x00o\x00m\x00'

my_str = my_bytes.decode('utf-16') # Decode using utf-16 codec.
print(my_str) # Output: tutorialreference.com

Encoding without BOM: UTF-8

UTF-8 does not include the BOM by default, so it won't be added to the byte sequence after encoding using UTF-8:

my_bytes = 'tutorialreference.com'.encode('utf-8') # Encoding without BOM
print(my_bytes)  # Output: b'tutorialreference.com'
my_str = my_bytes.decode('utf-8') # Decoding
print(my_str)    # Output: tutorialreference.com

When to Use `str.replace()`

Use str.replace('\ufeff', '') to remove the BOM character if you have a string and the character is present in the string:

my_str = '\ufefftutorialreference.com'
result = my_str.replace('\ufeff', '')
print(repr(result)) # Output: 'tutorialreference.com'

This approach is particularly helpful if you don't know the file's encoding before reading, or if the BOM character isn't at the beginning, but somewhere in the string.

How to Remove \ufeff (BOM) Characters from Strings in Python

Removing \ufeff with str.replace()​

Handling BOM Characters When Opening Files​

Understanding BOM and Encoding​

Encoding with BOM: UTF-8-SIG and UTF-16​

Encoding without BOM: UTF-8​

When to Use str.replace()​