14

I'm quite certain this has been asked and answered before, however, I cannot find the answer to my specific use-case.

I've got this file with accented characters in it:

>  ~ cat file
ë
ê
Ý,text
Ò
É

How would I convert them to their respective non-accented letters? So the outcome would be something along the lines of:

> ~ convert file out.txt
> ~ cat out.txt
e
e
Y,text
O
E

Note that the actual file itself contains more characters.

3
  • Those look like accented letters to me: en.wikipedia.org/wiki/Diacritic Of course, if you need to change some other symbols to letters too, by some rule, then that's different. Commented Jan 29, 2021 at 15:27
  • 3
    Would you change ü to ue (German equivalents) or plain u? Even in English, how would you expect to map æ? Commented Jan 29, 2021 at 16:11
  • Your first example is not an accent but a diaeresis. Do you want to convert those, too? Your question is self-contradictory in that regard. Commented Jan 30, 2021 at 12:53

3 Answers 3

24

You can try iconv, with the //TRANSLIT (transliteration) option

Ex. given

$ cat file
ë
ê
Ý,text
Ò
É

then

$ iconv -t ASCII//TRANSLIT file
e
e
Y,text
O
E
12
  • doesn't work on my mac. but works on CentOS 8. Thanks! Commented Jan 29, 2021 at 15:27
  • @KevinC I wonder if it would work on the mac if you added an appropriate -f value to specify the input encoding? Perhaps obtained using the file command on your input file? Commented Jan 29, 2021 at 15:32
  • 1
    Do you have the same iconv on both systems? Commented Jan 29, 2021 at 15:42
  • 2
    Very useful answer, thanks !! But I am disappointed when I look at the iconv(1) man page. It does not say anything about ascii//TRANSLIT. And iconv --list does not mention TRANSLIT. How can one find all these options for encodings ? Commented Mar 6, 2022 at 9:31
  • 3
    I don't get the same behavior: echo 'ÉÀîéàç' | iconv -f UTF-8 -t ASCII//TRANSLIT returns 'E`A^i'e`ac. My problem here is i don't want the quotes in the output (i know i may pipe it through tr -d but it would remove the actual quotes from the original text) Commented Mar 7, 2023 at 10:02
8

The GNU recode package is very useful to convert between character encodings, and it has a special case that does exactly this with the "flat" encoding:

recode -f utf8..flat <textin.txt >flattext.out
0
2

Using Raku (formerly known as Perl_6)

Raku performs NFC normalization by default (everything except file names). If you want to remove accents you need to decompose the character, meaning you need to use either the NFD or NFKD methods:

~$ echo 'été à la plage' | \
   raku -ne 'NFKD($_).map(*.chr.subst(:global, /\c[COMBINING ACUTE ACCENT]/, "")).join.put ;'
ete à la plage

...and...

~$ echo 'été à la plage' | \
   raku -ne 'NFKD($_).map(*.chr.subst(:global, /\c[COMBINING GRAVE ACCENT]/, "")).join.put ;'
été a la plage

...all together...

~$ echo 'été à la plage' | \
   raku -ne 'NFKD($_).map(*.chr.subst(:global, /\c[COMBINING ACUTE ACCENT] | \c[COMBINING GRAVE ACCENT]/, "")).join.put ;'
ete a la plage

Maybe the issue is you need to know what accents are added to your text? You can compare NFC normalization to NFD decomposition below:

...NFC():

~$ echo 'ëêÝÒÉ' | \
   raku -ne 'NFC($_).map( *.uniname).join(" | ").put for .comb;'
LATIN SMALL LETTER E WITH DIAERESIS
LATIN SMALL LETTER E WITH CIRCUMFLEX
LATIN CAPITAL LETTER Y WITH ACUTE
LATIN CAPITAL LETTER O WITH GRAVE
LATIN CAPITAL LETTER E WITH ACUTE

...NFD():

~$ echo 'ëêÝÒÉ' | \
   raku -ne 'NFD($_).map( *.uniname).join(" | ").put for .comb;'
LATIN SMALL LETTER E | COMBINING DIAERESIS
LATIN SMALL LETTER E | COMBINING CIRCUMFLEX ACCENT
LATIN CAPITAL LETTER Y | COMBINING ACUTE ACCENT
LATIN CAPITAL LETTER O | COMBINING GRAVE ACCENT
LATIN CAPITAL LETTER E | COMBINING ACUTE ACCENT

https://docs.raku.org/language/unicode
https://docs.raku.org/type/Uni
https://raku.org

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.