Skip to main content

How to Resolve Python Error "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92..."

The UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position X: invalid start byte (or similar errors with different byte values like 0x81, 0x8d, 0x91, etc.) is a specific Python error indicating a mismatch during text decoding. It means you are trying to interpret a sequence of bytes as text using the UTF-8 encoding standard, but Python encountered a byte (like 0x92) that is invalid according to the UTF-8 rules for starting a character sequence. This typically happens when the data was actually encoded using a different standard, most commonly cp1252 (Windows Latin-1) or similar legacy encodings.

This guide explains the cause of this specific decoding error and provides solutions focused on using the correct encoding.

Understanding the Error: UTF-8 vs. Legacy Encodings (cp1252)

  • UTF-8: A variable-width encoding standard capable of representing every character in the Unicode standard. It's the dominant encoding on the web and in modern systems. Certain byte values (especially those >= 128) have specific meanings indicating the start or continuation of multi-byte character sequences.
  • cp1252 (Windows-1252): A single-byte encoding common on older Windows systems for Western European languages. It assigns characters to byte values 128-255 that differ from UTF-8's interpretation. Bytes like 0x91, 0x92, 0x93, 0x94 represent specific characters (like smart quotes ‘ ’ “ ”) in cp1252.
  • latin-1 (ISO-8859-1): Another single-byte encoding. It maps byte values 0-255 directly to Unicode code points U+0000 to U+00FF. It can decode any byte sequence without error but might misinterpret bytes intended for other encodings (like cp1252).

The error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92... invalid start byte occurs because the byte 0x92 (which represents in cp1252) is not a valid starting byte for any character sequence according to the rules of UTF-8. When the UTF-8 decoder encounters this byte where it doesn't expect it, it fails.

The Cause: Decoding cp1252 (or similar) Bytes as UTF-8

This error happens when your byte data was originally encoded using cp1252 (or another incompatible legacy encoding), but you tell Python to decode it using 'utf-8'.

Scenario: Decoding Byte Objects

# String containing a character common in cp1252 but invalid as a start byte in UTF-8
original_string = "This isn’t UTF-8" # Contains '’' (U+2019, often 0x92 in cp1252)

# Encode using cp1252
cp1252_bytes = original_string.encode('cp1252')
print(f"Encoded bytes (cp1252): {cp1252_bytes}")
# Output: b'This isn\x92t UTF-8' (Note the 0x92)

try:
# ⛔️ UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 8: invalid start byte
# Trying to decode cp1252 bytes as UTF-8
decoded_string = cp1252_bytes.decode('utf-8')
print(decoded_string)
except UnicodeDecodeError as e:
print(e)

Scenario: Reading Files (open(), pandas.read_csv())

This is common when reading files (like CSVs, text files) saved on Windows systems using default legacy encodings, but trying to open them with encoding='utf-8'.

# Assume 'mydata.csv' was saved with cp1252 encoding and contains 'isn’t'

try:
# ⛔️ UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92...
with open('mydata.csv', 'r', encoding='utf-8') as f:
content = f.read()
except UnicodeDecodeError as e:
print(f"Error reading file with UTF-8: {e}")
except FileNotFoundError:
print("Error: File not found.")

# Similarly with pandas
import pandas as pd
try:
# ⛔️ UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92...
df = pd.read_csv('mydata.csv', encoding='utf-8')
except UnicodeDecodeError as e:
print(f"Error reading CSV with pandas/UTF-8: {e}")
except FileNotFoundError:
print("Error: File not found.")

Solution 1: Specify the Correct Encoding (e.g., cp1252, latin-1)

The best solution is to identify the actual encoding used to create the data and specify that encoding during decoding. If the data likely came from a Windows source and contains characters like smart quotes, cp1252 is a strong candidate.

For bytes.decode()

original_string = "This isn’t UTF-8"
cp1252_bytes = original_string.encode('cp1252') # b'This isn\x92t UTF-8'

# ✅ Decode using the original encoding
try:
decoded_string = cp1252_bytes.decode('cp1252')
print(f"Correctly decoded: '{decoded_string}'")
# Output: Correctly decoded: 'This isn’t UTF-8'
except UnicodeDecodeError as e:
print(f"Unexpected error with cp1252: {e}")

For open()

# Assume 'mydata.txt' is saved with cp1252 encoding
filename = 'mydata.txt'

try:
# ✅ Specify the correct encoding when opening
with open(filename, 'r', encoding='cp1252') as f:
content = f.read()
print(f"File content read with cp1252:\n'{content}'")
except FileNotFoundError:
print(f"Error: File '{filename}' not found.")
except UnicodeDecodeError as e:
print(f"Unexpected error reading with cp1252: {e}")
except Exception as e:
print(f"Other error reading file: {e}")

For pandas.read_csv()

import pandas as pd
filename = 'mydata.csv' # Assume cp1252 encoded CSV

try:
# ✅ Specify the correct encoding for pandas
df = pd.read_csv(filename, encoding='cp1252')
print("Pandas DataFrame read successfully with cp1252.")
# print(df.head())
except FileNotFoundError:
print(f"Error: File '{filename}' not found.")
except UnicodeDecodeError as e:
print(f"Unexpected error reading CSV with cp1252: {e}")
except Exception as e:
print(f"Other error reading CSV: {e}")

Identifying the Correct Encoding

  • Metadata: Check if the data source provides encoding information.
  • Common Legacy Encodings: If UTF-8 fails and the data originates from Windows, try cp1252 (or windows-1252). If it's older European data, latin-1 (or iso-8859-1) is another possibility. For Mac-originated data, mac_roman might be relevant.
  • Trial and Error: Try decoding/opening with cp1252 or latin-1. latin-1 will decode almost any byte but might produce garbage characters if the encoding was actually something else like cp1252. If cp1252 works and produces the expected characters (like smart quotes), it's likely correct.
  • Detection Libraries: Tools like the chardet library (pip install chardet) can guess the encoding, but detection is not foolproof.

Solution 2: Handle Decoding Errors (errors parameter - Use Cautiously)

If you can not determine the correct encoding or are willing to accept potential data corruption, you can tell the decoder how to handle problematic bytes using the errors parameter.

Using errors='ignore' (Data Loss)

Silently discards bytes that can not be decoded. Not recommended due to data loss.

cp1252_bytes = b'This isn\x92t UTF-8'
decoded_ignore = cp1252_bytes.decode('utf-8', errors='ignore')
print(f"Decoded (utf-8, ignore): '{decoded_ignore}'")
# Output: Decoded (utf-8, ignore): 'This isnt UTF-8'
# Note that 0x92 is lost

Using errors='replace' (Replacement Character)

Replaces undecodable bytes with a placeholder (usually ). Signals errors but alters data.

cp1252_bytes = b'This isn\x92t UTF-8'
decoded_replace = cp1252_bytes.decode('utf-8', errors='replace')
print(f"Decoded (utf-8, replace): '{decoded_replace}'")
# Output: Decoded (utf-8, replace): 'This isn�t UTF-8'

Again, using the correct encoding (Solution 1) is strongly preferred over using errors='ignore' or 'replace'.

Solution 3: Read File in Binary Mode ('rb')

If you don't actually need to interpret the file as text within your Python script (e.g., you just need to read the raw bytes to upload, copy, or process them differently), open the file in binary read mode ('rb'). This reads the raw bytes without attempting any decoding.

filename = 'mydata.txt' # Could be any encoding

try:
# ✅ Read raw bytes without decoding
with open(filename, 'rb') as f:
byte_content = f.read()
print(f"Read {len(byte_content)} bytes from {filename}.")
print(f"First 50 bytes: {byte_content[:50]}")
except FileNotFoundError:
print(f"Error: File '{filename}' not found.")
except Exception as e:
print(f"Error reading file in binary mode: {e}")

When opening in binary mode, do not specify the encoding argument.

Debugging the Error

  1. Identify Operation: Is the error from bytes.decode() or open()/pd.read_csv()?
  2. Check Specified Encoding: Note the encoding used (often utf-8 by default or explicitly).
  3. Examine Byte Value: The error message (byte 0x92) tells you the problematic byte. Look up this byte value in common encoding charts (like cp1252) to see if it corresponds to a likely character in your source data (e.g., 0x91/0x92 are often smart quotes from Microsoft products).
  4. Try Decoding with Alternatives: Attempt decoding/opening with 'cp1252' or 'latin-1' to see if it resolves the error and produces legible text.

Conclusion

The UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92... (or similar bytes >= 128) occurs when trying to decode bytes as UTF-8 that were actually encoded using a different standard, typically a legacy single-byte encoding like cp1252.

The most reliable solution is to:

  1. Identify the correct original encoding of the data (often cp1252 for data from Windows sources containing characters like smart quotes).
  2. Specify that correct encoding during the decoding step:
    • byte_object.decode('cp1252')
    • open(filename, encoding='cp1252')
    • pd.read_csv(filename, encoding='cp1252')

Using error handlers like errors='ignore' or 'replace' should be avoided if data integrity is important, as they lead to data loss or alteration. Reading in binary mode ('rb') is appropriate only if you need the raw bytes, not the decoded text.