How to Remove Non-ASCII Characters from Strings in Python
This guide explains how to remove non-ASCII characters from a string in Python. Non-ASCII characters are those outside the standard ASCII range (0-127). We'll cover two primary methods: using str.encode()
with error handling and using string.printable
with filtering.
Removing Non-ASCII Characters with encode()
and decode()
(Recommended)
The most straightforward and generally reliable method is to encode the string to ASCII, ignoring any characters that can't be encoded, and then decode it back to a string:
def remove_non_ascii(string):
return string.encode('ascii', errors='ignore').decode('ascii')
print(remove_non_ascii('a€bñcá')) # Output: abc
print(remove_non_ascii('a_b^0')) # Output: a_b^0
string.encode('ascii', errors='ignore')
: This attempts to encode the string using the ASCII encoding. Theerrors='ignore'
part is crucial: it tells the encoder to skip any characters that can't be represented in ASCII (instead of raising an error). The result is abytes
object..decode('ascii')
: This decodes the resultingbytes
object back into a string, using ASCII. Since we've already removed the non-ASCII characters during encoding, this decoding step is safe.
This method is efficient and clearly expresses the intent: remove anything that's not ASCII.
Removing Non-ASCII Characters using string.printable
The string.printable
constant (from the built-in string
module) contains all printable ASCII characters. We can use this to filter a string, keeping only those characters that are in this set:
import string
def remove_non_ascii(a_str):
ascii_chars = set(string.printable) # Using a set is important
return ''.join(
filter(lambda x: x in ascii_chars, a_str)
)
print(remove_non_ascii('a€bñcá')) # Output: abc
print(remove_non_ascii('a_b^0')) # Output: a_b^0
- Using
string.printable
explicitly includes printable characters such as whitespace, punctuation and digits.
Removing Non-ASCII Characters using ord()
You can also iterate over the string and check if each char is a number using the ord()
method, and add the valid chars to a list:
def remove_non_ascii(string):
return ''.join(char for char in string if ord(char) < 128)
print(remove_non_ascii('a€bñcá')) # Output: abc
print(remove_non_ascii('a_b^0')) # Output: a_b^0
- This creates a generator object that iterates through all characters and only returns those with unicode code points less than 128.