How to Solve "UnicodeDecodeError: 'charmap' codec can't decode byte" in Python
The UnicodeDecodeError: 'charmap' codec can't decode byte ...
error in Python occurs when you try to read or decode a file (or byte string) using the wrong character encoding. This typically happens on Windows, where the default 'charmap'
codec (often cp1252
) doesn't match the file's actual encoding (often UTF-8).
This guide explains how to diagnose and fix this error.
Understanding the Error
The error message:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1: character maps to <undefined>
means:
'charmap'
: Python is trying to use the system's default character encoding, which is often a legacy encoding likecp1252
on Windows.can't decode byte 0x9d
: The specific byte (represented in hexadecimal)0x9d
can not be mapped to a character in the'charmap'
encoding.in position 1
: The problematic byte is at the specified position (starting from 0) within the input.character maps to <undefined>
The character is not defined in the codec.
Solution 1: Specify the Correct Encoding (UTF-8)
The vast majority of text files today are encoded using UTF-8. The most common and reliable solution is to explicitly specify UTF-8 encoding when opening the file:
with open('example.txt', 'r', encoding='utf-8') as f:
lines = f.readlines()
print(lines)
encoding='utf-8'
: This tells Python to use UTF-8 to decode the file's contents.
Solution 2: Use utf-8-sig
for Files with a BOM
Some files (especially those created on Windows) might have a Byte Order Mark (BOM) at the beginning. If you see \ufeff
at the start of your output, use utf-8-sig
:
with open('example.txt', 'r', encoding='utf-8-sig') as f:
lines = f.readlines()
print(lines)
utf-8-sig
is a variant of UTF-8 that specifically handles the BOM.
Solution 3: Handling Unknown Encodings (with errors='ignore'
or chardet
)
If you don't know the file's encoding, you have two main options:
errors='ignore'
You can tell Python to ignore decoding errors. This will result in data loss, but it will prevent the program from crashing.
with open('example.txt', 'r', encoding='utf-8', errors='ignore') as f:
lines = f.readlines()
print(lines) # May contain replacement characters or missing data
errors='ignore'
: This tells Python to skip any bytes it can't decode. This is a last resort as it will lose data.
The chardet
Library
The chardet
library attempts to detect the encoding of a file. This is not foolproof, but it's often helpful.
pip install chardet
import chardet
with open('example.txt', 'rb') as f: # Open in binary mode for chardet
rawdata = f.read()
result = chardet.detect(rawdata)
encoding = result['encoding']
confidence = result['confidence']
print(f"Detected encoding: {encoding} (confidence: {confidence})")
with open('example.txt', 'r', encoding=encoding) as f: # Open again with correct encoding
lines = f.readlines()
print(lines)
chardet.detect()
analyzes the raw bytes and returns a dictionary with its best guess for the encoding and a confidence level.- Open the file in binary read mode (
'rb'
) when usingchardet
. - Use the detected encoding when you re-open the file in text mode (
'r'
).
Solution 4: Using Other Encodings (If You Know It)
If you know the file is encoded with a specific encoding (e.g., 'latin-1', 'cp437', 'utf-16'), use that encoding directly:
with open('example.txt', 'r', encoding='latin-1') as f:
lines = f.readlines()
print(lines)
with open('example.txt', 'r', encoding='cp437') as f: # Example with cp437
lines = f.readlines()
print(lines)
Finding the File's Encoding (If Unknown)
If you're unsure of the encoding, here are some ways to try and determine it:
Using the file
Command (Linux/macOS)
The file
command (on Linux/macOS, and available in Git Bash on Windows) can often guess the encoding:
file example.txt
- Check the result to find which encoding to use.
Using Notepad (Windows)
On Windows, Notepad can sometimes show the encoding:
- Open the file in Notepad.
- Go to "File" -> "Save As...".
- Look at the
Encoding dropdown near the bottom of the
Save As` dialog. It will show the currently detected encoding.
Conclusion
The UnicodeDecodeError: 'charmap'
error almost always indicates an encoding mismatch.
The best solution is to explicitly specify the correct encoding (usually UTF-8) when opening the file using encoding='utf-8'
or encoding='utf-8-sig'
.
- If the encoding is unknown, try using the
chardet
library to detect it, or try common encodings like 'latin-1'. - As a last resort, you can ignore errors with
errors='ignore'
, but be aware that this will result in data loss. - Always prioritize finding the correct encoding.