Replace special characters with ASCII equivalent

Question

Is there any lib that can replace special characters to ASCII equivalents, like:

"Cześć"

to:

"Czesc"

I can of course create map:

{'ś':'s', 'ć': 'c'}

and use some replace function. But I don't want to hardcode all equivalents into my program, if there is some function that already does that.

possible duplicate : stackoverflow.com/questions/1382998/latin-1-to-ascii — Anurag Uniyal
– Anurag Uniyal, Commented Jul 7, 2010 at 13:16

nosklo · Accepted Answer · 2010-07-07 12:19:56Z

51

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import unicodedata
text = u'Cześć'
print unicodedata.normalize('NFD', text).encode('ascii', 'ignore')

answered Jul 7, 2010 at 12:19

nosklo

224k58 gold badges299 silver badges299 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

dan04 Over a year ago

'NFKD' would give you ASCII output more often than 'NFD' would.

Szymon Roziewski Over a year ago

it doesnt work for all cases i.e. (VW Polo) - Zapłon Jak sprawdzić czy działa pompa wspomagania? converts to (VW Polo) - Zapon jak sprawdzic czy dziaa pompa wspomagania?

normanius · Accepted Answer · 2020-06-02 10:37:04Z

26

The package unidecode worked best for me:

from unidecode import unidecode
text = "Björn, Łukasz and Σωκράτης."
print(unidecode(text))
# ==> Bjorn, Lukasz and Sokrates.

You might need to install the package:

pip install unidecode

The above solution is easier and more robust than encoding (and decoding) the output of unicodedata.normalize(), as suggested by other answers.

# This doesn't work as expected:
ret = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')
print(ret)
# ==> b'Bjorn, ukasz and .'
# Besides not supporting all characters, the returned value is a
# bytes object in python3. To yield a str type:
ret = ret.decode("utf8") # (not required in python2)

edited Jun 2, 2020 at 10:37

answered Nov 1, 2019 at 19:31

normanius

9,9598 gold badges64 silver badges97 bronze badges

2 Comments

Robin Dinse Over a year ago

It translates "ß" into "ss", but "ä" into "a", not "ae".

normanius Over a year ago

@RobinDinse This is intentional, see the docu of unidecode for the reasoning behind this. You can always replace the three umlauts äöü yourself prior to passing a string to unidecode.

dan04 · Accepted Answer · 2010-07-12 06:10:54Z

You can get most of the way by doing:

import unicodedata

def strip_accents(text):
    return ''.join(c for c in unicodedata.normalize('NFKD', text) if unicodedata.category(c) != 'Mn')

Unfortunately, there exist accented Latin letters that cannot be decomposed into an ASCII letter + combining marks. You'll have to handle them manually. These include:

Æ → AE
Ð → D
Ø → O
Þ → TH
ß → ss
æ → ae
ð → d
ø → o
þ → th
Œ → OE
œ → oe
ƒ → f

Marcin Wojnarski · Accepted Answer · 2015-02-09 11:39:09Z

5

Try the trans package. Looks very promising. Supports Polish.

edited Feb 9, 2015 at 11:39

answered Mar 13, 2012 at 11:40

Marcin Wojnarski

2,74026 silver badges21 bronze badges

2 Comments

UltraNurd Over a year ago

This was perfect for me, and it's BSD-licensed.

admfotad Over a year ago

For me too, works like charm

Grzegorz Skrzypczak · Accepted Answer · 2012-04-06 13:50:34Z

I did it this way:

POLISH_CHARACTERS = {
    50309:'a',50311:'c',50329:'e',50562:'l',50564:'n',50099:'o',50587:'s',50618:'z',50620:'z',
    50308:'A',50310:'C',50328:'E',50561:'L',50563:'N',50067:'O',50586:'S',50617:'Z',50619:'Z',}

def encodePL(text):
    nrmtxt = unicodedata.normalize('NFC',text)
    i = 0
    ret_str = []
    while i < len(nrmtxt):
        if ord(text[i])>128: # non ASCII character
            fbyte = ord(text[i])
            sbyte = ord(text[i+1])
            lkey = (fbyte << 8) + sbyte
            ret_str.append(POLISH_CHARACTERS.get(lkey))
            i = i+1
        else: # pure ASCII character
            ret_str.append(text[i])
        i = i+1
    return ''.join(ret_str)

when executed:

encodePL(u'ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ')

it will produce output like this:

u'acelnoszz ACELNOSZZ'

This works fine for me - ;D

John Machin · Accepted Answer · 2010-07-12 07:13:14Z

1

The unicodedata.normalize gimmick can best be described as half-assci. Here is a robust approach which includes a map for letters with no decomposition. Note the additional map entries in the comments.

answered Jul 12, 2010 at 7:13

John Machin

83.2k12 gold badges147 silver badges193 bronze badges

Collectives™ on Stack Overflow

Replace special characters with ASCII equivalent

6 Answers 6

2 Comments

2 Comments

Comments

2 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

2 Comments

Comments

2 Comments

Comments

Comments

Linked

Related