Skip to main content

How to Remove Accents from Strings in Python

Removing accents (diacritics) from strings is a common task in text processing, especially for data cleaning and normalization.

This guide explores various methods for removing accents from strings in Python, including using the unidecode package and the built-in unicodedata module.

Removing Accents with the unidecode Package

The unidecode package is designed to transliterate Unicode characters into their closest ASCII equivalents. The first thing you should do is install the Unidecode package using pip:

pip install Unidecode
#or
pip3 install Unidecode

Then you can use the unidecode function:

from unidecode import unidecode

str_with_accents = 'ÂéüÒÑ'
str_without_accents = unidecode(str_with_accents)
print(str_without_accents) # Output: AeuON
  • The unidecode() function replaces Unicode characters with their closest ASCII equivalents.

Handling Non-Translatable Characters

By default, the library drops any characters that it can not translate to ASCII.

from unidecode import unidecode

str_with_accents = 'ÂéüÒÑ\ue123' # Non-translatable character is present
str_without_accents = unidecode(str_with_accents)
print(str_without_accents) # Output: AeuON

Specifying Error Handling

To raise an error if unidecode encounters a character it can not translate, set the errors parameter to 'strict':

from unidecode import unidecode, UnidecodeError
str_with_accents = 'ÂéüÒÑ\ue123'

try:
str_without_accents = unidecode(str_with_accents, errors='strict')
except UnidecodeError as e:
print(e.index) # Output: 5 (the position where the unknown character was encountered)

Preserving Non-Translatable Characters

To preserve unknown characters, set the errors parameter to 'preserve'. This will only work if the type of the returned variable is not forced to be ASCII compatible.

from unidecode import unidecode

str_with_accents = 'ÂéüÒÑ\ue123'
str_without_accents = unidecode(
str_with_accents,
errors='preserve',
)
print(str_without_accents) # Output: AeuON

The unicode replacement character is added to the characters that are not supported.

Removing Accents with unicodedata

The built-in unicodedata module is an alternative for removing accents, and doesn't require any external libraries to be installed.

import unicodedata

def remove_accents(string):
return ''.join(
char for char in unicodedata.normalize('NFD', string)
if unicodedata.category(char) != 'Mn'
)

str_with_accents = 'ÂéüÒÑ'
print(remove_accents(str_with_accents)) # Output: AeuON
print(remove_accents('Noël, Adrián, Sørina, Zoë, Renée')) # Output: Noel, Adrian, Sørina, Zoe, Renee
  • The unicodedata.normalize('NFD', string) decomposes the Unicode characters into their base characters and combining diacritical marks.
  • The generator expression then filters out characters with a category of 'Mn' which stands for Nonspacing_Mark code.
  • The two steps of normalizing, and then removing characters ensures that characters with diacritics are converted to their base equivalents, and all the accents are removed.