How to Remove Accents from Strings in Python
Removing accents (diacritics) from strings is a common task in text processing, especially for data cleaning and normalization.
This guide explores various methods for removing accents from strings in Python, including using the unidecode
package and the built-in unicodedata
module.
Removing Accents with the unidecode
Package
The unidecode
package is designed to transliterate Unicode characters into their closest ASCII equivalents. The first thing you should do is install the Unidecode
package using pip
:
pip install Unidecode
#or
pip3 install Unidecode
Then you can use the unidecode
function:
from unidecode import unidecode
str_with_accents = 'ÂéüÒÑ'
str_without_accents = unidecode(str_with_accents)
print(str_without_accents) # Output: AeuON
- The
unidecode()
function replaces Unicode characters with their closest ASCII equivalents.
Handling Non-Translatable Characters
By default, the library drops any characters that it can not translate to ASCII.
from unidecode import unidecode
str_with_accents = 'ÂéüÒÑ\ue123' # Non-translatable character is present
str_without_accents = unidecode(str_with_accents)
print(str_without_accents) # Output: AeuON
Specifying Error Handling
To raise an error if unidecode
encounters a character it can not translate, set the errors
parameter to 'strict'
:
from unidecode import unidecode, UnidecodeError
str_with_accents = 'ÂéüÒÑ\ue123'
try:
str_without_accents = unidecode(str_with_accents, errors='strict')
except UnidecodeError as e:
print(e.index) # Output: 5 (the position where the unknown character was encountered)
Preserving Non-Translatable Characters
To preserve unknown characters, set the errors
parameter to 'preserve'
. This will only work if the type of the returned variable is not forced to be ASCII compatible.
from unidecode import unidecode
str_with_accents = 'ÂéüÒÑ\ue123'
str_without_accents = unidecode(
str_with_accents,
errors='preserve',
)
print(str_without_accents) # Output: AeuON
The unicode replacement character is added to the characters that are not supported.
Removing Accents with unicodedata
The built-in unicodedata
module is an alternative for removing accents, and doesn't require any external libraries to be installed.
import unicodedata
def remove_accents(string):
return ''.join(
char for char in unicodedata.normalize('NFD', string)
if unicodedata.category(char) != 'Mn'
)
str_with_accents = 'ÂéüÒÑ'
print(remove_accents(str_with_accents)) # Output: AeuON
print(remove_accents('Noël, Adrián, Sørina, Zoë, Renée')) # Output: Noel, Adrian, Sørina, Zoe, Renee
- The
unicodedata.normalize('NFD', string)
decomposes the Unicode characters into their base characters and combining diacritical marks. - The generator expression then filters out characters with a category of
'Mn'
which stands for Nonspacing_Mark code. - The two steps of normalizing, and then removing characters ensures that characters with diacritics are converted to their base equivalents, and all the accents are removed.