replace non-ASCII characters to UTF-8 format

Question

I would like to know if there is any way to find a character (uft-8 format) equivalent to a non-ASCII character.

I've done several tests with the unidecode library, but it doesn't completely satisfy what I need.

For example, consider this characters:

import unidecode

x = 'ⁱ, ᴠ, Ғ, Ƭ, ѳ, ❶'
unidecode.unidecode(x)

Output = "i, V, G', T, f, "

I would like something like: "i, v, f, t, o, 1"

There must be a way? Thanks in advance for any help!!

"I would like something like: "i, v, f, t, o, 1"" Okay; and why is "i, V, G', T, f, " not good enough? What is the actual rule you want to apply? If you expect that every character should have a specific mapping, with no particular pattern to it, then you will need to just have some kind of hard-coded mapping (whether you make it yourself or use a library like unidecode) - there is no getting around that. It's only possible to write an algorithm for things that have an algorithmic approach. — Karl Knechtel
– Karl Knechtel, Commented Feb 26, 2022 at 23:49
Also: did you try reading the documentation for unidecode, to see if there are any configuration options that could get you results more like what you want? — Karl Knechtel
– Karl Knechtel, Commented Feb 26, 2022 at 23:49
Ғ is not related to F at all. It's mainly used to represent voice velar and uvular fricatives (think French "r") in Turkic languages that use the Cyrillic language. Just because two glyphs look similar does not mean there's a mapping that represents the similarity. — chepner
– chepner, Commented Feb 27, 2022 at 0:08

Stephan · Accepted Answer · 2022-02-27 03:23:24Z

As others have mentioned, there's not necessarily a relation because characters simply look alike. Seems most similar examples online are mainly focused on removing accents and would use the standard Python library unicodedata. It uses standard approaches to converting to ASCII like NFKD (NFKD explained here)

More Common unicodedata Approach

import unicodedata
str_unicode = u"ⁱ, ᴠ, Ғ, Ƭ, ѳ, ❶"

#replace = any characters that can't be translated will be replaced with ?
print(unicodedata.normalize('NFKD',str_unicode).encode("ascii",'replace'))
#will ignore any errors
print(unicodedata.normalize('NFKD',str_unicode).encode("ascii",'ignore'))

unicodedata Output

'i, ?, ?, ?, ?, ?'
'i, , , , , '

unidecode Map with Translate

The unidecode library seems closer for your specific example. I think you will have to augment it though with a translate call to cleanup characters the library doesn't map.

I added a second example a character that couldn't be mapped. I added paragraph mark "¶" mapped to "P" for reference

import unicodedata
import unidecode

#Script
str_unicode = u"ⁱ, ᴠ, Ғ, Ƭ, ѳ, ❶, ¶"
dict_mapping = str.maketrans("❶¶","1P")

str_unidecode = unidecode.unidecode(str_unicode)
str_unidecode_translated = unidecode.unidecode(str_unicode.translate(dict_mapping))

print(str_unidecode)
print(str_unidecode_translated)

Thank you so much Stephan, your answer helped me a lot and I already have a direction.

Collectives™ on Stack Overflow

replace non-ASCII characters to UTF-8 format

1 Answer 1

More Common unicodedata Approach

unicodedata Output

unidecode Map with Translate

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

More Common unicodedata Approach

unicodedata Output

unidecode Map with Translate

1 Comment

Related