116

I'm attempting to remove accents from characters in PHP string as the first step to making the string usable in a URL.

I'm using the following code:

$input = "Fóø Bår";

setlocale(LC_ALL, "en_US.utf8");
$output = iconv("utf-8", "ascii//TRANSLIT", $input);

print($output);

The output I would expect would be something like this:

F'oo Bar

However, instead of the accented characters being transliterated they are replaced with question marks:

F?? B?r

Everything I can find online indicates that setting the locale will fix this problem, however I'm already doing this. I've already checked the following details:

  1. The locale I am setting is supported by the server (included in the list produced by locale -a)
  2. The source and target encodings (UTF-8 and ASCII) are supported by the server's version of iconv (included in the list produced by iconv -l)
  3. The input string is UTF-8 encoded (verified using PHP's mb_check_encoding function, as suggested in the answer by mercator)
  4. The call to setlocale is successful (it returns 'en_US.utf8' rather than FALSE)

The cause of the problem:

The server is using the wrong implementation of iconv. It has the glibc version instead of the required libiconv version.

Note that the iconv function on some systems may not work as you expect. In such case, it'd be a good idea to install the GNU libiconv library. It will most likely end up with more consistent results.
PHP manual's introduction to iconv

Details about the iconv implementation that is used by PHP are included in the output of the phpinfo function.

(I'm not able to re-compile PHP with the correct iconv library on the server I'm working with for this project so the answer I've accepted below is the one that was most useful for removing accents without iconv support.)

6
  • Note that if you're running this on a string that can't be ASCII, this will have dramatic effects. For example a Russian string won't work with ASCII. Commented Oct 4, 2011 at 12:01
  • 1
    I have the glibc version install and setting the locale works for me. Commented Apr 2, 2012 at 19:47
  • So you had to compile it? I can't find a deb package anywhere. Exactly coz of the reason that "IT'S" in glibc already :-( Commented May 10, 2013 at 19:04
  • 1
    This guy suggests a clever solution using htmlentities(). Sorry it's in French, but you just need the small functions at the bottom of the doc: weirdog.com/blog/php/… Really clever :) Commented Dec 10, 2013 at 20:35
  • 2
    For how want to see the code of which @JFG speak about, you can also found it here: github.com/ICanBoogie/Common/blob/… Commented Dec 3, 2014 at 17:46

31 Answers 31

131

What about the WordPress implementation?

function remove_accents($string) {
    if ( !preg_match('/[\x80-\xff]/', $string) )
        return $string;

    $chars = array(
    // Decompositions for Latin-1 Supplement
    chr(195).chr(128) => 'A', chr(195).chr(129) => 'A',
    chr(195).chr(130) => 'A', chr(195).chr(131) => 'A',
    chr(195).chr(132) => 'A', chr(195).chr(133) => 'A',
    chr(195).chr(135) => 'C', chr(195).chr(136) => 'E',
    chr(195).chr(137) => 'E', chr(195).chr(138) => 'E',
    chr(195).chr(139) => 'E', chr(195).chr(140) => 'I',
    chr(195).chr(141) => 'I', chr(195).chr(142) => 'I',
    chr(195).chr(143) => 'I', chr(195).chr(145) => 'N',
    chr(195).chr(146) => 'O', chr(195).chr(147) => 'O',
    chr(195).chr(148) => 'O', chr(195).chr(149) => 'O',
    chr(195).chr(150) => 'O', chr(195).chr(153) => 'U',
    chr(195).chr(154) => 'U', chr(195).chr(155) => 'U',
    chr(195).chr(156) => 'U', chr(195).chr(157) => 'Y',
    chr(195).chr(159) => 's', chr(195).chr(160) => 'a',
    chr(195).chr(161) => 'a', chr(195).chr(162) => 'a',
    chr(195).chr(163) => 'a', chr(195).chr(164) => 'a',
    chr(195).chr(165) => 'a', chr(195).chr(167) => 'c',
    chr(195).chr(168) => 'e', chr(195).chr(169) => 'e',
    chr(195).chr(170) => 'e', chr(195).chr(171) => 'e',
    chr(195).chr(172) => 'i', chr(195).chr(173) => 'i',
    chr(195).chr(174) => 'i', chr(195).chr(175) => 'i',
    chr(195).chr(177) => 'n', chr(195).chr(178) => 'o',
    chr(195).chr(179) => 'o', chr(195).chr(180) => 'o',
    chr(195).chr(181) => 'o', chr(195).chr(182) => 'o',
    chr(195).chr(182) => 'o', chr(195).chr(185) => 'u',
    chr(195).chr(186) => 'u', chr(195).chr(187) => 'u',
    chr(195).chr(188) => 'u', chr(195).chr(189) => 'y',
    chr(195).chr(191) => 'y',
    // Decompositions for Latin Extended-A
    chr(196).chr(128) => 'A', chr(196).chr(129) => 'a',
    chr(196).chr(130) => 'A', chr(196).chr(131) => 'a',
    chr(196).chr(132) => 'A', chr(196).chr(133) => 'a',
    chr(196).chr(134) => 'C', chr(196).chr(135) => 'c',
    chr(196).chr(136) => 'C', chr(196).chr(137) => 'c',
    chr(196).chr(138) => 'C', chr(196).chr(139) => 'c',
    chr(196).chr(140) => 'C', chr(196).chr(141) => 'c',
    chr(196).chr(142) => 'D', chr(196).chr(143) => 'd',
    chr(196).chr(144) => 'D', chr(196).chr(145) => 'd',
    chr(196).chr(146) => 'E', chr(196).chr(147) => 'e',
    chr(196).chr(148) => 'E', chr(196).chr(149) => 'e',
    chr(196).chr(150) => 'E', chr(196).chr(151) => 'e',
    chr(196).chr(152) => 'E', chr(196).chr(153) => 'e',
    chr(196).chr(154) => 'E', chr(196).chr(155) => 'e',
    chr(196).chr(156) => 'G', chr(196).chr(157) => 'g',
    chr(196).chr(158) => 'G', chr(196).chr(159) => 'g',
    chr(196).chr(160) => 'G', chr(196).chr(161) => 'g',
    chr(196).chr(162) => 'G', chr(196).chr(163) => 'g',
    chr(196).chr(164) => 'H', chr(196).chr(165) => 'h',
    chr(196).chr(166) => 'H', chr(196).chr(167) => 'h',
    chr(196).chr(168) => 'I', chr(196).chr(169) => 'i',
    chr(196).chr(170) => 'I', chr(196).chr(171) => 'i',
    chr(196).chr(172) => 'I', chr(196).chr(173) => 'i',
    chr(196).chr(174) => 'I', chr(196).chr(175) => 'i',
    chr(196).chr(176) => 'I', chr(196).chr(177) => 'i',
    chr(196).chr(178) => 'IJ',chr(196).chr(179) => 'ij',
    chr(196).chr(180) => 'J', chr(196).chr(181) => 'j',
    chr(196).chr(182) => 'K', chr(196).chr(183) => 'k',
    chr(196).chr(184) => 'k', chr(196).chr(185) => 'L',
    chr(196).chr(186) => 'l', chr(196).chr(187) => 'L',
    chr(196).chr(188) => 'l', chr(196).chr(189) => 'L',
    chr(196).chr(190) => 'l', chr(196).chr(191) => 'L',
    chr(197).chr(128) => 'l', chr(197).chr(129) => 'L',
    chr(197).chr(130) => 'l', chr(197).chr(131) => 'N',
    chr(197).chr(132) => 'n', chr(197).chr(133) => 'N',
    chr(197).chr(134) => 'n', chr(197).chr(135) => 'N',
    chr(197).chr(136) => 'n', chr(197).chr(137) => 'N',
    chr(197).chr(138) => 'n', chr(197).chr(139) => 'N',
    chr(197).chr(140) => 'O', chr(197).chr(141) => 'o',
    chr(197).chr(142) => 'O', chr(197).chr(143) => 'o',
    chr(197).chr(144) => 'O', chr(197).chr(145) => 'o',
    chr(197).chr(146) => 'OE',chr(197).chr(147) => 'oe',
    chr(197).chr(148) => 'R',chr(197).chr(149) => 'r',
    chr(197).chr(150) => 'R',chr(197).chr(151) => 'r',
    chr(197).chr(152) => 'R',chr(197).chr(153) => 'r',
    chr(197).chr(154) => 'S',chr(197).chr(155) => 's',
    chr(197).chr(156) => 'S',chr(197).chr(157) => 's',
    chr(197).chr(158) => 'S',chr(197).chr(159) => 's',
    chr(197).chr(160) => 'S', chr(197).chr(161) => 's',
    chr(197).chr(162) => 'T', chr(197).chr(163) => 't',
    chr(197).chr(164) => 'T', chr(197).chr(165) => 't',
    chr(197).chr(166) => 'T', chr(197).chr(167) => 't',
    chr(197).chr(168) => 'U', chr(197).chr(169) => 'u',
    chr(197).chr(170) => 'U', chr(197).chr(171) => 'u',
    chr(197).chr(172) => 'U', chr(197).chr(173) => 'u',
    chr(197).chr(174) => 'U', chr(197).chr(175) => 'u',
    chr(197).chr(176) => 'U', chr(197).chr(177) => 'u',
    chr(197).chr(178) => 'U', chr(197).chr(179) => 'u',
    chr(197).chr(180) => 'W', chr(197).chr(181) => 'w',
    chr(197).chr(182) => 'Y', chr(197).chr(183) => 'y',
    chr(197).chr(184) => 'Y', chr(197).chr(185) => 'Z',
    chr(197).chr(186) => 'z', chr(197).chr(187) => 'Z',
    chr(197).chr(188) => 'z', chr(197).chr(189) => 'Z',
    chr(197).chr(190) => 'z', chr(197).chr(191) => 's'
    );

    $string = strtr($string, $chars);

    return $string;
}

To understand what this function does, check the conversion table:

À => A
Á => A
 => A
à => A
Ä => A
Å => A
Ç => C
È => E
É => E
Ê => E
Ë => E
Ì => I
Í => I
Î => I
Ï => I
Ñ => N
Ò => O
Ó => O
Ô => O
Õ => O
Ö => O
Ù => U
Ú => U
Û => U
Ü => U
Ý => Y
ß => s
à => a
á => a
â => a
ã => a
ä => a
å => a
ç => c
è => e
é => e
ê => e
ë => e
ì => i
í => i
î => i
ï => i
ñ => n
ò => o
ó => o
ô => o
õ => o
ö => o
ù => u
ú => u
û => u
ü => u
ý => y
ÿ => y
Ā => A
ā => a
Ă => A
ă => a
Ą => A
ą => a
Ć => C
ć => c
Ĉ => C
ĉ => c
Ċ => C
ċ => c
Č => C
č => c
Ď => D
ď => d
Đ => D
đ => d
Ē => E
ē => e
Ĕ => E
ĕ => e
Ė => E
ė => e
Ę => E
ę => e
Ě => E
ě => e
Ĝ => G
ĝ => g
Ğ => G
ğ => g
Ġ => G
ġ => g
Ģ => G
ģ => g
Ĥ => H
ĥ => h
Ħ => H
ħ => h
Ĩ => I
ĩ => i
Ī => I
ī => i
Ĭ => I
ĭ => i
Į => I
į => i
İ => I
ı => i
IJ => IJ
ij => ij
Ĵ => J
ĵ => j
Ķ => K
ķ => k
ĸ => k
Ĺ => L
ĺ => l
Ļ => L
ļ => l
Ľ => L
ľ => l
Ŀ => L
ŀ => l
Ł => L
ł => l
Ń => N
ń => n
Ņ => N
ņ => n
Ň => N
ň => n
ʼn => N
Ŋ => n
ŋ => N
Ō => O
ō => o
Ŏ => O
ŏ => o
Ő => O
ő => o
Œ => OE
œ => oe
Ŕ => R
ŕ => r
Ŗ => R
ŗ => r
Ř => R
ř => r
Ś => S
ś => s
Ŝ => S
ŝ => s
Ş => S
ş => s
Š => S
š => s
Ţ => T
ţ => t
Ť => T
ť => t
Ŧ => T
ŧ => t
Ũ => U
ũ => u
Ū => U
ū => u
Ŭ => U
ŭ => u
Ů => U
ů => u
Ű => U
ű => u
Ų => U
ų => u
Ŵ => W
ŵ => w
Ŷ => Y
ŷ => y
Ÿ => Y
Ź => Z
ź => z
Ż => Z
ż => z
Ž => Z
ž => z
ſ => s

You can generate the conversion table yourself by simply iterating over the $chars array of the function:

foreach($chars as $k=>$v) {
   printf("%s -> %s", $k, $v);
}
Sign up to request clarification or add additional context in comments.

12 Comments

This should have been the accepted answer, since it was implemented in a safer way (using chr() function) instead of hard-coding accented characters, which might get overwritten in some text-editors.
Note that the new implementation is now located here : core.trac.wordpress.org/browser/tags/4.1/src/wp-includes/… . The code requires other functions provided par wordpress core, and can not be copy/paste without some (light) work.
Working as a charm, THANK YOU!
It should be noted that WordPress is GPLv2 licensed.
|
64

UTF-8 friendly version of the simple function posted above by Gino:

function stripAccents($str) {
    return strtr(utf8_decode($str), utf8_decode('àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ'), 'aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY');
}

Had to come to this because my php document was UTF-8 encoded.

Hope it helps.

5 Comments

Certain characters in UTF8 do not work properly for me using this function. I believe that's due to utf8_decode(), which converts from UTF8 to ISO-8859-1.
@trevor-gehman: strtr() only works on single-byte characters, hence those in Unicode Latin-1 Supplement.
Strtr works fine for replacing multi byte UTF8 characters, but you need to use the variant where you supply an associative array as the second argument. Then you can have a multi byte character as the key or value in any position of that array. Make sure your text editor is in UTF8 mode or encode using "\xc2\x81" type syntax.
this is a very incomplete implementation, see John R's reply
Note that utf8_decode is deprecated since PHP 8.2.0, and should not be used anymore.
56

if you have http://php.net/manual/en/book.intl.php available, this solved your problem

$string = "Fóø Bår";
$transliterator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);
echo $normalized = $transliterator->transliterate($string);

8 Comments

Lower() is not required in this case
These links might help figuring out rules and what NFD and NFC means: ICU User guide and Unicode Norm Forms
Can't believe the most upvoted answers are about hardcoding character maps. This should be the accepted answer.
This answer led me to the right solution for my needs. The rules here are too strong, I used only those rules to remove accents ':: NFD; :: [:Nonspacing Mark:] Remove; :: NFC;'
Note: it requires php-intl module to be installed and enabled in config. Make sure you have it available if you plan to use this solution.
|
50

This is a piece of code I found and use often:

function stripAccents($stripAccents){
  return strtr($stripAccents,'àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ','aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY');
}

4 Comments

As strtr() isn't multibyte aware, if your script file is encoded in a multibyte format (e.g. UTF-8) this function produces wrong results.
True. I tried to fork a github project just to edit a single line, not even related to accented chars, but when I saved the changes and created a pull request, it included the additional changes on all the lines that had accented chars hard-coded. The safer way is to use chr().
...additionally - this list does not contain many accented characters, e.g. ů, ž, ř, č, ...
Señor, Bjørk, Ł, etc. This is a list that is far, far longer than what's here.
26

The easiest way is to use iconv() PHP native function.

 echo iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', "Thîs îs à vêry wrong séntènce!");

 // output: This is a very wrong sentence!

8 Comments

This is not very reliable echo iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', 'usuario o contraseña incorrectos'); outputs usuario o contrase?a incorrectos
For your example you could use: setlocale(LC_CTYPE, 'cs_CZ'); echo iconv('UTF-8', 'ASCII//TRANSLIT', "usuario o contraseña incorrectos"); // output: usuario o contrasena incorrectos. Please refer to PHP Documentation for more info. Everything is there! php.net/manual/en/function.iconv.php
The iconv UTF8 to ASCII transliterations seem to be very strange. I get "usuario o contrase~na incorrectos" for my locale. It converts things like ñ to ~n and ö to :o I have found that transliterating to latin1 then converting extended ascii manually works best.
@Phil_1984_ What's your locale?
@Phil It's not en_UK but en_GB. UK is not the country code of the UK, the TLD .uk is just an exception. The traditional and official name here is still GB for Great Britain.
|
19

When using iconv, the parameter locale must be set:

function test_enc($text = 'ěščřžýáíé ĚŠČŘŽÝÁÍÉ fóø bår FÓØ BÅR æ')
{
    echo '<tt>';
    echo iconv('utf8', 'ascii//TRANSLIT', $text);
    echo '</tt><br/>';
} 

test_enc();
setlocale(LC_ALL, 'cs_CZ.utf8');
test_enc();
setlocale(LC_ALL, 'en_US.utf8');
test_enc();

Yields into:

????????? ????????? f?? b?r F?? B?R ae
escrzyaie ESCRZYAIE fo? bar FO? BAR ae
escrzyaie ESCRZYAIE fo? bar FO? BAR ae

Another locales then cs_CZ and en_US I haven't installed and I can't test it.

In C# I see solution using translation to unicode normalized form - accents are splitted out and then filtered via nonspacing unicode category.

Comments

12

Indeed is a matter of taste. There are many flavors for converting such letters.

function replaceAccents($str)
{
  $a = array('À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ï', 'Ð', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', 'Ø', 'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'ÿ', 'Ā', 'ā', 'Ă', 'ă', 'Ą', 'ą', 'Ć', 'ć', 'Ĉ', 'ĉ', 'Ċ', 'ċ', 'Č', 'č', 'Ď', 'ď', 'Đ', 'đ', 'Ē', 'ē', 'Ĕ', 'ĕ', 'Ė', 'ė', 'Ę', 'ę', 'Ě', 'ě', 'Ĝ', 'ĝ', 'Ğ', 'ğ', 'Ġ', 'ġ', 'Ģ', 'ģ', 'Ĥ', 'ĥ', 'Ħ', 'ħ', 'Ĩ', 'ĩ', 'Ī', 'ī', 'Ĭ', 'ĭ', 'Į', 'į', 'İ', 'ı', 'IJ', 'ij', 'Ĵ', 'ĵ', 'Ķ', 'ķ', 'Ĺ', 'ĺ', 'Ļ', 'ļ', 'Ľ', 'ľ', 'Ŀ', 'ŀ', 'Ł', 'ł', 'Ń', 'ń', 'Ņ', 'ņ', 'Ň', 'ň', 'ʼn', 'Ō', 'ō', 'Ŏ', 'ŏ', 'Ő', 'ő', 'Œ', 'œ', 'Ŕ', 'ŕ', 'Ŗ', 'ŗ', 'Ř', 'ř', 'Ś', 'ś', 'Ŝ', 'ŝ', 'Ş', 'ş', 'Š', 'š', 'Ţ', 'ţ', 'Ť', 'ť', 'Ŧ', 'ŧ', 'Ũ', 'ũ', 'Ū', 'ū', 'Ŭ', 'ŭ', 'Ů', 'ů', 'Ű', 'ű', 'Ų', 'ų', 'Ŵ', 'ŵ', 'Ŷ', 'ŷ', 'Ÿ', 'Ź', 'ź', 'Ż', 'ż', 'Ž', 'ž', 'ſ', 'ƒ', 'Ơ', 'ơ', 'Ư', 'ư', 'Ǎ', 'ǎ', 'Ǐ', 'ǐ', 'Ǒ', 'ǒ', 'Ǔ', 'ǔ', 'Ǖ', 'ǖ', 'Ǘ', 'ǘ', 'Ǚ', 'ǚ', 'Ǜ', 'ǜ', 'Ǻ', 'ǻ', 'Ǽ', 'ǽ', 'Ǿ', 'ǿ');
  $b = array('A', 'A', 'A', 'A', 'A', 'A', 'AE', 'C', 'E', 'E', 'E', 'E', 'I', 'I', 'I', 'I', 'D', 'N', 'O', 'O', 'O', 'O', 'O', 'O', 'U', 'U', 'U', 'U', 'Y', 's', 'a', 'a', 'a', 'a', 'a', 'a', 'ae', 'c', 'e', 'e', 'e', 'e', 'i', 'i', 'i', 'i', 'n', 'o', 'o', 'o', 'o', 'o', 'o', 'u', 'u', 'u', 'u', 'y', 'y', 'A', 'a', 'A', 'a', 'A', 'a', 'C', 'c', 'C', 'c', 'C', 'c', 'C', 'c', 'D', 'd', 'D', 'd', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'G', 'g', 'G', 'g', 'G', 'g', 'G', 'g', 'H', 'h', 'H', 'h', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'IJ', 'ij', 'J', 'j', 'K', 'k', 'L', 'l', 'L', 'l', 'L', 'l', 'L', 'l', 'l', 'l', 'N', 'n', 'N', 'n', 'N', 'n', 'n', 'O', 'o', 'O', 'o', 'O', 'o', 'OE', 'oe', 'R', 'r', 'R', 'r', 'R', 'r', 'S', 's', 'S', 's', 'S', 's', 'S', 's', 'T', 't', 'T', 't', 'T', 't', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'W', 'w', 'Y', 'y', 'Y', 'Z', 'z', 'Z', 'z', 'Z', 'z', 's', 'f', 'O', 'o', 'U', 'u', 'A', 'a', 'I', 'i', 'O', 'o', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'A', 'a', 'AE', 'ae', 'O', 'o');
  return str_replace($a, $b, $str);
}

1 Comment

this is a very incomplete implementation, see John R's reply
10

I think the problem here is that your encodings consider ä and å different symbols to 'a'. In fact, the PHP documentation for strtr offers a sample for removing accents the ugly way :(

https://www.php.net/strtr

4 Comments

I think you should probably suggest mb_strstr() instead, as his input is UTF8
The //TRANSLIT in the iconv call is meant to convert to the nearest available alternative in the target encoding. This should include removing accents, or converting a single character into two, e.g. ñ might become n~
Since the server doesn't support iconv properly, looks like I'll be doing it this way afterall. Thanks Jeremy.
@karim79 mb_strstr is the wrong function, and there is no mb_strtr
8

here is a simple function that i use usually to remove accents :

function str_without_accents($str, $charset='utf-8')
{
    $str = htmlentities($str, ENT_NOQUOTES, $charset);

    $str = preg_replace('#&([A-za-z])(?:acute|cedil|caron|circ|grave|orn|ring|slash|th|tilde|uml);#', '\1', $str);
    $str = preg_replace('#&([A-za-z]{2})(?:lig);#', '\1', $str); // pour les ligatures e.g. '&oelig;'
    $str = preg_replace('#&[^;]+;#', '', $str); // supprime les autres caractères

    return $str;   // or add this : mb_strtoupper($str); for uppercase :)
}

1 Comment

This does not seem very future-proof. If further HTML entities are allocated in future then they might coincidentally match your regex (lots of words end in "ring", for example).
6

You could use urlencode. Does not quite do what you want (remove accents), but will give you a url usable string

$output = urlencode ($input);

In Perl I could use a translate regex, but I cannot think of the PHP equivalent

$input =~ tr/áâàå/aaaa/;

etc...

you could do this using preg_replace

$patterns[0] = '/[á|â|à|å|ä]/';
$patterns[1] = '/[ð|é|ê|è|ë]/';
$patterns[2] = '/[í|î|ì|ï]/';
$patterns[3] = '/[ó|ô|ò|ø|õ|ö]/';
$patterns[4] = '/[ú|û|ù|ü]/';
$patterns[5] = '/æ/';
$patterns[6] = '/ç/';
$patterns[7] = '/ß/';
$replacements[0] = 'a';
$replacements[1] = 'e';
$replacements[2] = 'i';
$replacements[3] = 'o';
$replacements[4] = 'u';
$replacements[5] = 'ae';
$replacements[6] = 'c';
$replacements[7] = 'ss';

$output = preg_replace($patterns, $replacements, $input);

(Please note this was typed from a foggy beer ridden Friday after noon memory, so may not be 100% correct)

or you could make a hash table and do a replacement based off of that.

1 Comment

php equivalent of tr/.../... is strtr
6

All of these are wrong. https://stackoverflow.com/a/35177899/308851 gets close but it gets Latin involved and gives no source for the rule either.

Let's check the standard... the library provided by the Unicode Consortium is ICU and the documentation has this to say:

For example, to remove accents from characters, use the following transform:

NFD; [:Nonspacing Mark:] Remove; NFC.

This transform separates accents from their base characters, removes the accents, and then puts the remaining text into an unaccented form.

That's it. No need for anything else.

Thus, if you have intl installed then you can do

$transliterator = Transliterator::createFromRules(':: NFD; :: [:Mn:] Remove; :: NFC;');
echo $transliterator->transliterate($string);

That's it. That's the answer to the question.

If you need to do this somewhere where intl is not available, you can snapshot what it does on a machine which does have intl:

<?php

$transliterator = \Transliterator::createFromRules(':: NFD; :: [:Mn:] Remove; :: NFC;');
$letters = preg_grep('/\pL/u', array_map('utf8', range(0x80, 0x2000)));
$letters = array_combine($letters, $letters);
$transliterated = array_map([$transliterator, 'transliterate'], $letters);
$map = array_diff_assoc($transliterated, $letters);
print count($map);
$search  = [' => ', 'array (', ')', '  ', "\n"];
$replace = ['=>', '[', ']', '', ''];
$map = str_replace($search, $replace, var_export($map, TRUE));
file_put_contents("map.php", "<?php\n\$map = $map;");
function utf8($num)
{
  if($num<=0x7F)       return chr($num);
  if($num<=0x7FF)      return chr(($num>>6)+192).chr(($num&63)+128);
  if($num<=0xFFFF)     return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
  if($num<=0x1FFFFF)   return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128).chr(($num&63)+128);
  return '';
}
?>

Here I have snapshotted the first 8192 characters which includes most non-Hiragana/Katakana scripts. The results are 9146 bytes long (Hiragana/Katakana more than doubles that and I didn't need it) encoding 827 replacements. You can put in your codebase and use it like this:

  function removeAccents($string) {
    include 'map.php';
    return strtr($string, $map);
  }

(this is demo, rather inline map.php into a class constant etc)

This is similar to https://stackoverflow.com/a/29280197/308851 but it includes a much larger number of characters -- and shows you how to obtain this mapping.

Ps.: Transliterating to ASCII is an entirely different bag of hurt. This is answer to "How do I remove accents from characters in a PHP string?"

Pps.: Here's the snapshot:

$map = ['À'=>'A','Á'=>'A','Â'=>'A','Ã'=>'A','Ä'=>'A','Å'=>'A','Ç'=>'C','È'=>'E','É'=>'E','Ê'=>'E','Ë'=>'E','Ì'=>'I','Í'=>'I','Î'=>'I','Ï'=>'I','Ñ'=>'N','Ò'=>'O','Ó'=>'O','Ô'=>'O','Õ'=>'O','Ö'=>'O','Ù'=>'U','Ú'=>'U','Û'=>'U','Ü'=>'U','Ý'=>'Y','à'=>'a','á'=>'a','â'=>'a','ã'=>'a','ä'=>'a','å'=>'a','ç'=>'c','è'=>'e','é'=>'e','ê'=>'e','ë'=>'e','ì'=>'i','í'=>'i','î'=>'i','ï'=>'i','ñ'=>'n','ò'=>'o','ó'=>'o','ô'=>'o','õ'=>'o','ö'=>'o','ù'=>'u','ú'=>'u','û'=>'u','ü'=>'u','ý'=>'y','ÿ'=>'y','Ā'=>'A','ā'=>'a','Ă'=>'A','ă'=>'a','Ą'=>'A','ą'=>'a','Ć'=>'C','ć'=>'c','Ĉ'=>'C','ĉ'=>'c','Ċ'=>'C','ċ'=>'c','Č'=>'C','č'=>'c','Ď'=>'D','ď'=>'d','Ē'=>'E','ē'=>'e','Ĕ'=>'E','ĕ'=>'e','Ė'=>'E','ė'=>'e','Ę'=>'E','ę'=>'e','Ě'=>'E','ě'=>'e','Ĝ'=>'G','ĝ'=>'g','Ğ'=>'G','ğ'=>'g','Ġ'=>'G','ġ'=>'g','Ģ'=>'G','ģ'=>'g','Ĥ'=>'H','ĥ'=>'h','Ĩ'=>'I','ĩ'=>'i','Ī'=>'I','ī'=>'i','Ĭ'=>'I','ĭ'=>'i','Į'=>'I','į'=>'i','İ'=>'I','Ĵ'=>'J','ĵ'=>'j','Ķ'=>'K','ķ'=>'k','Ĺ'=>'L','ĺ'=>'l','Ļ'=>'L','ļ'=>'l','Ľ'=>'L','ľ'=>'l','Ń'=>'N','ń'=>'n','Ņ'=>'N','ņ'=>'n','Ň'=>'N','ň'=>'n','Ō'=>'O','ō'=>'o','Ŏ'=>'O','ŏ'=>'o','Ő'=>'O','ő'=>'o','Ŕ'=>'R','ŕ'=>'r','Ŗ'=>'R','ŗ'=>'r','Ř'=>'R','ř'=>'r','Ś'=>'S','ś'=>'s','Ŝ'=>'S','ŝ'=>'s','Ş'=>'S','ş'=>'s','Š'=>'S','š'=>'s','Ţ'=>'T','ţ'=>'t','Ť'=>'T','ť'=>'t','Ũ'=>'U','ũ'=>'u','Ū'=>'U','ū'=>'u','Ŭ'=>'U','ŭ'=>'u','Ů'=>'U','ů'=>'u','Ű'=>'U','ű'=>'u','Ų'=>'U','ų'=>'u','Ŵ'=>'W','ŵ'=>'w','Ŷ'=>'Y','ŷ'=>'y','Ÿ'=>'Y','Ź'=>'Z','ź'=>'z','Ż'=>'Z','ż'=>'z','Ž'=>'Z','ž'=>'z','Ơ'=>'O','ơ'=>'o','Ư'=>'U','ư'=>'u','Ǎ'=>'A','ǎ'=>'a','Ǐ'=>'I','ǐ'=>'i','Ǒ'=>'O','ǒ'=>'o','Ǔ'=>'U','ǔ'=>'u','Ǖ'=>'U','ǖ'=>'u','Ǘ'=>'U','ǘ'=>'u','Ǚ'=>'U','ǚ'=>'u','Ǜ'=>'U','ǜ'=>'u','Ǟ'=>'A','ǟ'=>'a','Ǡ'=>'A','ǡ'=>'a','Ǣ'=>'Æ','ǣ'=>'æ','Ǧ'=>'G','ǧ'=>'g','Ǩ'=>'K','ǩ'=>'k','Ǫ'=>'O','ǫ'=>'o','Ǭ'=>'O','ǭ'=>'o','Ǯ'=>'Ʒ','ǯ'=>'ʒ','ǰ'=>'j','Ǵ'=>'G','ǵ'=>'g','Ǹ'=>'N','ǹ'=>'n','Ǻ'=>'A','ǻ'=>'a','Ǽ'=>'Æ','ǽ'=>'æ','Ǿ'=>'Ø','ǿ'=>'ø','Ȁ'=>'A','ȁ'=>'a','Ȃ'=>'A','ȃ'=>'a','Ȅ'=>'E','ȅ'=>'e','Ȇ'=>'E','ȇ'=>'e','Ȉ'=>'I','ȉ'=>'i','Ȋ'=>'I','ȋ'=>'i','Ȍ'=>'O','ȍ'=>'o','Ȏ'=>'O','ȏ'=>'o','Ȑ'=>'R','ȑ'=>'r','Ȓ'=>'R','ȓ'=>'r','Ȕ'=>'U','ȕ'=>'u','Ȗ'=>'U','ȗ'=>'u','Ș'=>'S','ș'=>'s','Ț'=>'T','ț'=>'t','Ȟ'=>'H','ȟ'=>'h','Ȧ'=>'A','ȧ'=>'a','Ȩ'=>'E','ȩ'=>'e','Ȫ'=>'O','ȫ'=>'o','Ȭ'=>'O','ȭ'=>'o','Ȯ'=>'O','ȯ'=>'o','Ȱ'=>'O','ȱ'=>'o','Ȳ'=>'Y','ȳ'=>'y','ʹ'=>'ʹ','Ά'=>'Α','Έ'=>'Ε','Ή'=>'Η','Ί'=>'Ι','Ό'=>'Ο','Ύ'=>'Υ','Ώ'=>'Ω','ΐ'=>'ι','Ϊ'=>'Ι','Ϋ'=>'Υ','ά'=>'α','έ'=>'ε','ή'=>'η','ί'=>'ι','ΰ'=>'υ','ϊ'=>'ι','ϋ'=>'υ','ό'=>'ο','ύ'=>'υ','ώ'=>'ω','ϓ'=>'ϒ','ϔ'=>'ϒ','Ѐ'=>'Е','Ё'=>'Е','Ѓ'=>'Г','Ї'=>'І','Ќ'=>'К','Ѝ'=>'И','Ў'=>'У','Й'=>'И','й'=>'и','ѐ'=>'е','ё'=>'е','ѓ'=>'г','ї'=>'і','ќ'=>'к','ѝ'=>'и','ў'=>'у','Ѷ'=>'Ѵ','ѷ'=>'ѵ','Ӂ'=>'Ж','ӂ'=>'ж','Ӑ'=>'А','ӑ'=>'а','Ӓ'=>'А','ӓ'=>'а','Ӗ'=>'Е','ӗ'=>'е','Ӛ'=>'Ә','ӛ'=>'ә','Ӝ'=>'Ж','ӝ'=>'ж','Ӟ'=>'З','ӟ'=>'з','Ӣ'=>'И','ӣ'=>'и','Ӥ'=>'И','ӥ'=>'и','Ӧ'=>'О','ӧ'=>'о','Ӫ'=>'Ө','ӫ'=>'ө','Ӭ'=>'Э','ӭ'=>'э','Ӯ'=>'У','ӯ'=>'у','Ӱ'=>'У','ӱ'=>'у','Ӳ'=>'У','ӳ'=>'у','Ӵ'=>'Ч','ӵ'=>'ч','Ӹ'=>'Ы','ӹ'=>'ы','آ'=>'ا','أ'=>'ا','ؤ'=>'و','إ'=>'ا','ئ'=>'ي','ۀ'=>'ە','ۂ'=>'ہ','ۓ'=>'ے','ऩ'=>'न','ऱ'=>'र','ऴ'=>'ळ','क़'=>'क','ख़'=>'ख','ग़'=>'ग','ज़'=>'ज','ड़'=>'ड','ढ़'=>'ढ','फ़'=>'फ','य़'=>'य','ড়'=>'ড','ঢ়'=>'ঢ','য়'=>'য','ਲ਼'=>'ਲ','ਸ਼'=>'ਸ','ਖ਼'=>'ਖ','ਗ਼'=>'ਗ','ਜ਼'=>'ਜ','ਫ਼'=>'ਫ','ଡ଼'=>'ଡ','ଢ଼'=>'ଢ','གྷ'=>'ག','ཌྷ'=>'ཌ','དྷ'=>'ད','བྷ'=>'བ','ཛྷ'=>'ཛ','ཀྵ'=>'ཀ','ဦ'=>'ဥ','Ḁ'=>'A','ḁ'=>'a','Ḃ'=>'B','ḃ'=>'b','Ḅ'=>'B','ḅ'=>'b','Ḇ'=>'B','ḇ'=>'b','Ḉ'=>'C','ḉ'=>'c','Ḋ'=>'D','ḋ'=>'d','Ḍ'=>'D','ḍ'=>'d','Ḏ'=>'D','ḏ'=>'d','Ḑ'=>'D','ḑ'=>'d','Ḓ'=>'D','ḓ'=>'d','Ḕ'=>'E','ḕ'=>'e','Ḗ'=>'E','ḗ'=>'e','Ḙ'=>'E','ḙ'=>'e','Ḛ'=>'E','ḛ'=>'e','Ḝ'=>'E','ḝ'=>'e','Ḟ'=>'F','ḟ'=>'f','Ḡ'=>'G','ḡ'=>'g','Ḣ'=>'H','ḣ'=>'h','Ḥ'=>'H','ḥ'=>'h','Ḧ'=>'H','ḧ'=>'h','Ḩ'=>'H','ḩ'=>'h','Ḫ'=>'H','ḫ'=>'h','Ḭ'=>'I','ḭ'=>'i','Ḯ'=>'I','ḯ'=>'i','Ḱ'=>'K','ḱ'=>'k','Ḳ'=>'K','ḳ'=>'k','Ḵ'=>'K','ḵ'=>'k','Ḷ'=>'L','ḷ'=>'l','Ḹ'=>'L','ḹ'=>'l','Ḻ'=>'L','ḻ'=>'l','Ḽ'=>'L','ḽ'=>'l','Ḿ'=>'M','ḿ'=>'m','Ṁ'=>'M','ṁ'=>'m','Ṃ'=>'M','ṃ'=>'m','Ṅ'=>'N','ṅ'=>'n','Ṇ'=>'N','ṇ'=>'n','Ṉ'=>'N','ṉ'=>'n','Ṋ'=>'N','ṋ'=>'n','Ṍ'=>'O','ṍ'=>'o','Ṏ'=>'O','ṏ'=>'o','Ṑ'=>'O','ṑ'=>'o','Ṓ'=>'O','ṓ'=>'o','Ṕ'=>'P','ṕ'=>'p','Ṗ'=>'P','ṗ'=>'p','Ṙ'=>'R','ṙ'=>'r','Ṛ'=>'R','ṛ'=>'r','Ṝ'=>'R','ṝ'=>'r','Ṟ'=>'R','ṟ'=>'r','Ṡ'=>'S','ṡ'=>'s','Ṣ'=>'S','ṣ'=>'s','Ṥ'=>'S','ṥ'=>'s','Ṧ'=>'S','ṧ'=>'s','Ṩ'=>'S','ṩ'=>'s','Ṫ'=>'T','ṫ'=>'t','Ṭ'=>'T','ṭ'=>'t','Ṯ'=>'T','ṯ'=>'t','Ṱ'=>'T','ṱ'=>'t','Ṳ'=>'U','ṳ'=>'u','Ṵ'=>'U','ṵ'=>'u','Ṷ'=>'U','ṷ'=>'u','Ṹ'=>'U','ṹ'=>'u','Ṻ'=>'U','ṻ'=>'u','Ṽ'=>'V','ṽ'=>'v','Ṿ'=>'V','ṿ'=>'v','Ẁ'=>'W','ẁ'=>'w','Ẃ'=>'W','ẃ'=>'w','Ẅ'=>'W','ẅ'=>'w','Ẇ'=>'W','ẇ'=>'w','Ẉ'=>'W','ẉ'=>'w','Ẋ'=>'X','ẋ'=>'x','Ẍ'=>'X','ẍ'=>'x','Ẏ'=>'Y','ẏ'=>'y','Ẑ'=>'Z','ẑ'=>'z','Ẓ'=>'Z','ẓ'=>'z','Ẕ'=>'Z','ẕ'=>'z','ẖ'=>'h','ẗ'=>'t','ẘ'=>'w','ẙ'=>'y','ẛ'=>'ſ','Ạ'=>'A','ạ'=>'a','Ả'=>'A','ả'=>'a','Ấ'=>'A','ấ'=>'a','Ầ'=>'A','ầ'=>'a','Ẩ'=>'A','ẩ'=>'a','Ẫ'=>'A','ẫ'=>'a','Ậ'=>'A','ậ'=>'a','Ắ'=>'A','ắ'=>'a','Ằ'=>'A','ằ'=>'a','Ẳ'=>'A','ẳ'=>'a','Ẵ'=>'A','ẵ'=>'a','Ặ'=>'A','ặ'=>'a','Ẹ'=>'E','ẹ'=>'e','Ẻ'=>'E','ẻ'=>'e','Ẽ'=>'E','ẽ'=>'e','Ế'=>'E','ế'=>'e','Ề'=>'E','ề'=>'e','Ể'=>'E','ể'=>'e','Ễ'=>'E','ễ'=>'e','Ệ'=>'E','ệ'=>'e','Ỉ'=>'I','ỉ'=>'i','Ị'=>'I','ị'=>'i','Ọ'=>'O','ọ'=>'o','Ỏ'=>'O','ỏ'=>'o','Ố'=>'O','ố'=>'o','Ồ'=>'O','ồ'=>'o','Ổ'=>'O','ổ'=>'o','Ỗ'=>'O','ỗ'=>'o','Ộ'=>'O','ộ'=>'o','Ớ'=>'O','ớ'=>'o','Ờ'=>'O','ờ'=>'o','Ở'=>'O','ở'=>'o','Ỡ'=>'O','ỡ'=>'o','Ợ'=>'O','ợ'=>'o','Ụ'=>'U','ụ'=>'u','Ủ'=>'U','ủ'=>'u','Ứ'=>'U','ứ'=>'u','Ừ'=>'U','ừ'=>'u','Ử'=>'U','ử'=>'u','Ữ'=>'U','ữ'=>'u','Ự'=>'U','ự'=>'u','Ỳ'=>'Y','ỳ'=>'y','Ỵ'=>'Y','ỵ'=>'y','Ỷ'=>'Y','ỷ'=>'y','Ỹ'=>'Y','ỹ'=>'y','ἀ'=>'α','ἁ'=>'α','ἂ'=>'α','ἃ'=>'α','ἄ'=>'α','ἅ'=>'α','ἆ'=>'α','ἇ'=>'α','Ἀ'=>'Α','Ἁ'=>'Α','Ἂ'=>'Α','Ἃ'=>'Α','Ἄ'=>'Α','Ἅ'=>'Α','Ἆ'=>'Α','Ἇ'=>'Α','ἐ'=>'ε','ἑ'=>'ε','ἒ'=>'ε','ἓ'=>'ε','ἔ'=>'ε','ἕ'=>'ε','Ἐ'=>'Ε','Ἑ'=>'Ε','Ἒ'=>'Ε','Ἓ'=>'Ε','Ἔ'=>'Ε','Ἕ'=>'Ε','ἠ'=>'η','ἡ'=>'η','ἢ'=>'η','ἣ'=>'η','ἤ'=>'η','ἥ'=>'η','ἦ'=>'η','ἧ'=>'η','Ἠ'=>'Η','Ἡ'=>'Η','Ἢ'=>'Η','Ἣ'=>'Η','Ἤ'=>'Η','Ἥ'=>'Η','Ἦ'=>'Η','Ἧ'=>'Η','ἰ'=>'ι','ἱ'=>'ι','ἲ'=>'ι','ἳ'=>'ι','ἴ'=>'ι','ἵ'=>'ι','ἶ'=>'ι','ἷ'=>'ι','Ἰ'=>'Ι','Ἱ'=>'Ι','Ἲ'=>'Ι','Ἳ'=>'Ι','Ἴ'=>'Ι','Ἵ'=>'Ι','Ἶ'=>'Ι','Ἷ'=>'Ι','ὀ'=>'ο','ὁ'=>'ο','ὂ'=>'ο','ὃ'=>'ο','ὄ'=>'ο','ὅ'=>'ο','Ὀ'=>'Ο','Ὁ'=>'Ο','Ὂ'=>'Ο','Ὃ'=>'Ο','Ὄ'=>'Ο','Ὅ'=>'Ο','ὐ'=>'υ','ὑ'=>'υ','ὒ'=>'υ','ὓ'=>'υ','ὔ'=>'υ','ὕ'=>'υ','ὖ'=>'υ','ὗ'=>'υ','Ὑ'=>'Υ','Ὓ'=>'Υ','Ὕ'=>'Υ','Ὗ'=>'Υ','ὠ'=>'ω','ὡ'=>'ω','ὢ'=>'ω','ὣ'=>'ω','ὤ'=>'ω','ὥ'=>'ω','ὦ'=>'ω','ὧ'=>'ω','Ὠ'=>'Ω','Ὡ'=>'Ω','Ὢ'=>'Ω','Ὣ'=>'Ω','Ὤ'=>'Ω','Ὥ'=>'Ω','Ὦ'=>'Ω','Ὧ'=>'Ω','ὰ'=>'α','ά'=>'α','ὲ'=>'ε','έ'=>'ε','ὴ'=>'η','ή'=>'η','ὶ'=>'ι','ί'=>'ι','ὸ'=>'ο','ό'=>'ο','ὺ'=>'υ','ύ'=>'υ','ὼ'=>'ω','ώ'=>'ω','ᾀ'=>'α','ᾁ'=>'α','ᾂ'=>'α','ᾃ'=>'α','ᾄ'=>'α','ᾅ'=>'α','ᾆ'=>'α','ᾇ'=>'α','ᾈ'=>'Α','ᾉ'=>'Α','ᾊ'=>'Α','ᾋ'=>'Α','ᾌ'=>'Α','ᾍ'=>'Α','ᾎ'=>'Α','ᾏ'=>'Α','ᾐ'=>'η','ᾑ'=>'η','ᾒ'=>'η','ᾓ'=>'η','ᾔ'=>'η','ᾕ'=>'η','ᾖ'=>'η','ᾗ'=>'η','ᾘ'=>'Η','ᾙ'=>'Η','ᾚ'=>'Η','ᾛ'=>'Η','ᾜ'=>'Η','ᾝ'=>'Η','ᾞ'=>'Η','ᾟ'=>'Η','ᾠ'=>'ω','ᾡ'=>'ω','ᾢ'=>'ω','ᾣ'=>'ω','ᾤ'=>'ω','ᾥ'=>'ω','ᾦ'=>'ω','ᾧ'=>'ω','ᾨ'=>'Ω','ᾩ'=>'Ω','ᾪ'=>'Ω','ᾫ'=>'Ω','ᾬ'=>'Ω','ᾭ'=>'Ω','ᾮ'=>'Ω','ᾯ'=>'Ω','ᾰ'=>'α','ᾱ'=>'α','ᾲ'=>'α','ᾳ'=>'α','ᾴ'=>'α','ᾶ'=>'α','ᾷ'=>'α','Ᾰ'=>'Α','Ᾱ'=>'Α','Ὰ'=>'Α','Ά'=>'Α','ᾼ'=>'Α','ι'=>'ι','ῂ'=>'η','ῃ'=>'η','ῄ'=>'η','ῆ'=>'η','ῇ'=>'η','Ὲ'=>'Ε','Έ'=>'Ε','Ὴ'=>'Η','Ή'=>'Η','ῌ'=>'Η','ῐ'=>'ι','ῑ'=>'ι','ῒ'=>'ι','ΐ'=>'ι','ῖ'=>'ι','ῗ'=>'ι','Ῐ'=>'Ι','Ῑ'=>'Ι','Ὶ'=>'Ι','Ί'=>'Ι','ῠ'=>'υ','ῡ'=>'υ','ῢ'=>'υ','ΰ'=>'υ','ῤ'=>'ρ','ῥ'=>'ρ','ῦ'=>'υ','ῧ'=>'υ','Ῠ'=>'Υ','Ῡ'=>'Υ','Ὺ'=>'Υ','Ύ'=>'Υ','Ῥ'=>'Ρ','ῲ'=>'ω','ῳ'=>'ω','ῴ'=>'ω','ῶ'=>'ω','ῷ'=>'ω','Ὸ'=>'Ο','Ό'=>'Ο','Ὼ'=>'Ω','Ώ'=>'Ω','ῼ'=>'Ω'];

Comments

4

I just created a removeAccents method based on the reading of this thread and this other one too (How to remove accents and turn letters into "plain" ASCII characters?).

The method is here: https://github.com/lingtalfi/Bat/blob/master/StringTool.md#removeaccents

Tests are here: https://github.com/lingtalfi/Bat/blob/master/btests/StringTool/removeAccents/stringTool.removeAccents.test.php,

and here is what was tested so far:

$a = [
    // easy
    '',
    'a',
    'après',
    'dédé fait la fête ?',
    // hard
    'àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ',
    'ŻŹĆŃĄŚŁĘÓżźćńąśłęó',
    'qqqqŻŹĆŃĄŚŁĘÓżźćńąśłęóqqq',
    'ŠŽšžŸÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïðñòóôõöøùúûüýÿ',       
    'ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõöøùúûüýÿ',
    'ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİĴĵĶķ',
    'ĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž',
    'ſƒƠơƯưǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǺǻǾǿ',
    'Ǽǽ',
];

and it converts only accentuated things (letters/ligatures/cédilles/some letters with a line through/...?).

Here is the content of the method: (https://github.com/lingtalfi/Bat/blob/master/StringTool.php#L83)

public static function removeAccents($str)
{
    static $map = [
        // single letters
        'à' => 'a',
        'á' => 'a',
        'â' => 'a',
        'ã' => 'a',
        'ä' => 'a',
        'ą' => 'a',
        'å' => 'a',
        'ā' => 'a',
        'ă' => 'a',
        'ǎ' => 'a',
        'ǻ' => 'a',
        'À' => 'A',
        'Á' => 'A',
        'Â' => 'A',
        'Ã' => 'A',
        'Ä' => 'A',
        'Ą' => 'A',
        'Å' => 'A',
        'Ā' => 'A',
        'Ă' => 'A',
        'Ǎ' => 'A',
        'Ǻ' => 'A',


        'ç' => 'c',
        'ć' => 'c',
        'ĉ' => 'c',
        'ċ' => 'c',
        'č' => 'c',
        'Ç' => 'C',
        'Ć' => 'C',
        'Ĉ' => 'C',
        'Ċ' => 'C',
        'Č' => 'C',

        'ď' => 'd',
        'đ' => 'd',
        'Ð' => 'D',
        'Ď' => 'D',
        'Đ' => 'D',


        'è' => 'e',
        'é' => 'e',
        'ê' => 'e',
        'ë' => 'e',
        'ę' => 'e',
        'ē' => 'e',
        'ĕ' => 'e',
        'ė' => 'e',
        'ě' => 'e',
        'È' => 'E',
        'É' => 'E',
        'Ê' => 'E',
        'Ë' => 'E',
        'Ę' => 'E',
        'Ē' => 'E',
        'Ĕ' => 'E',
        'Ė' => 'E',
        'Ě' => 'E',

        'ƒ' => 'f',


        'ĝ' => 'g',
        'ğ' => 'g',
        'ġ' => 'g',
        'ģ' => 'g',
        'Ĝ' => 'G',
        'Ğ' => 'G',
        'Ġ' => 'G',
        'Ģ' => 'G',


        'ĥ' => 'h',
        'ħ' => 'h',
        'Ĥ' => 'H',
        'Ħ' => 'H',

        'ì' => 'i',
        'í' => 'i',
        'î' => 'i',
        'ï' => 'i',
        'ĩ' => 'i',
        'ī' => 'i',
        'ĭ' => 'i',
        'į' => 'i',
        'ſ' => 'i',
        'ǐ' => 'i',
        'Ì' => 'I',
        'Í' => 'I',
        'Î' => 'I',
        'Ï' => 'I',
        'Ĩ' => 'I',
        'Ī' => 'I',
        'Ĭ' => 'I',
        'Į' => 'I',
        'İ' => 'I',
        'Ǐ' => 'I',

        'ĵ' => 'j',
        'Ĵ' => 'J',

        'ķ' => 'k',
        'Ķ' => 'K',


        'ł' => 'l',
        'ĺ' => 'l',
        'ļ' => 'l',
        'ľ' => 'l',
        'ŀ' => 'l',
        'Ł' => 'L',
        'Ĺ' => 'L',
        'Ļ' => 'L',
        'Ľ' => 'L',
        'Ŀ' => 'L',


        'ñ' => 'n',
        'ń' => 'n',
        'ņ' => 'n',
        'ň' => 'n',
        'ʼn' => 'n',
        'Ñ' => 'N',
        'Ń' => 'N',
        'Ņ' => 'N',
        'Ň' => 'N',

        'ò' => 'o',
        'ó' => 'o',
        'ô' => 'o',
        'õ' => 'o',
        'ö' => 'o',
        'ð' => 'o',
        'ø' => 'o',
        'ō' => 'o',
        'ŏ' => 'o',
        'ő' => 'o',
        'ơ' => 'o',
        'ǒ' => 'o',
        'ǿ' => 'o',
        'Ò' => 'O',
        'Ó' => 'O',
        'Ô' => 'O',
        'Õ' => 'O',
        'Ö' => 'O',
        'Ø' => 'O',
        'Ō' => 'O',
        'Ŏ' => 'O',
        'Ő' => 'O',
        'Ơ' => 'O',
        'Ǒ' => 'O',
        'Ǿ' => 'O',


        'ŕ' => 'r',
        'ŗ' => 'r',
        'ř' => 'r',
        'Ŕ' => 'R',
        'Ŗ' => 'R',
        'Ř' => 'R',


        'ś' => 's',
        'š' => 's',
        'ŝ' => 's',
        'ş' => 's',
        'Ś' => 'S',
        'Š' => 'S',
        'Ŝ' => 'S',
        'Ş' => 'S',

        'ţ' => 't',
        'ť' => 't',
        'ŧ' => 't',
        'Ţ' => 'T',
        'Ť' => 'T',
        'Ŧ' => 'T',


        'ù' => 'u',
        'ú' => 'u',
        'û' => 'u',
        'ü' => 'u',
        'ũ' => 'u',
        'ū' => 'u',
        'ŭ' => 'u',
        'ů' => 'u',
        'ű' => 'u',
        'ų' => 'u',
        'ư' => 'u',
        'ǔ' => 'u',
        'ǖ' => 'u',
        'ǘ' => 'u',
        'ǚ' => 'u',
        'ǜ' => 'u',
        'Ù' => 'U',
        'Ú' => 'U',
        'Û' => 'U',
        'Ü' => 'U',
        'Ũ' => 'U',
        'Ū' => 'U',
        'Ŭ' => 'U',
        'Ů' => 'U',
        'Ű' => 'U',
        'Ų' => 'U',
        'Ư' => 'U',
        'Ǔ' => 'U',
        'Ǖ' => 'U',
        'Ǘ' => 'U',
        'Ǚ' => 'U',
        'Ǜ' => 'U',


        'ŵ' => 'w',
        'Ŵ' => 'W',

        'ý' => 'y',
        'ÿ' => 'y',
        'ŷ' => 'y',
        'Ý' => 'Y',
        'Ÿ' => 'Y',
        'Ŷ' => 'Y',

        'ż' => 'z',
        'ź' => 'z',
        'ž' => 'z',
        'Ż' => 'Z',
        'Ź' => 'Z',
        'Ž' => 'Z',


        // accentuated ligatures
        'Ǽ' => 'A',
        'ǽ' => 'a',
    ];
    return strtr($str, $map);
}

1 Comment

Nice lib, thanks for sharing. Works for CZ locale.
4

What's wrong with this one? Works with UTF8

function strip_accents($s){
  return str_replace(
    explode(' ', preg_replace('/ +/', ' ', 'č ć ž š đ  Č Ć Ž Š Đ  à á â ã ä ç è é ê ë ì í î ï ñ ò ó ô õ ö ù ú û ü ý ÿ À Á Â Ã Ä Ç È É Ê Ë Ì Í Î Ï Ñ Ò Ó Ô Õ Ö Ù Ú Û Ü Ý')),
    explode(' ', preg_replace('/ +/', ' ', 'c c z s dj C C Z S DJ a a a a a c e e e e i i i i n o o o o o u u u u y y A A A A A C E E E E I I I I N O O O O O U U U U Y')),
    $s);
}

It can be faster by not using preg_replace, but speed was not my goal here.

Comments

2

I agree with georgebrock's comment.

If you find a way to get //TRANSLIT to work, you can build friendly URLs:

  1. use iconv with //TRANSLIT ñ => n~
    • remove non-alphanumeric non-whitespace chars inside words: $url = preg_replace( '/(\w)[^\w\s](\w)/', '$1$2', $url );
    • replace remaining separations: $url = preg_replace( '/[^a-z0-9]+/', '-', $url );
    • remove double/leading/traling: $url = preg_replace( '-', e.g. '/(?:(^|\-)\-+|\-$)/', '', $url );

If you can't get it to work, replace setp 1 with strtr/character-based replacement, like Xetius' solution.

Comments

2

I can't reproduce your problem. I get the expected result.

How exactly are you using mb_detect_encoding() to verify your string is in fact UTF-8?

If I simply call mb_detect_encoding($input) on both a UTF-8 and ISO-8859-1 encoded version of your string, both of them return "UTF-8", so that function isn't particularly reliable.

iconv() gives me a PHP "notice" when it gets the wrongly encoded string and only echoes "F", but that might just be because of different PHP/iconv settings/versions (?).

I suggest to you try calling mb_check_encoding($input, "utf-8") first to verify that your string really is UTF-8. I think it probably isn't.

6 Comments

Thanks for the tip. mb_check_encoding($input, "utf-8") is returning TRUE. Also, I was already using error_reporting(E_ALL) so there shouldn't be any errors slipping past me.
Hmmm, I see your point. I tried it on another machine now and that returns "Fo? Bar". What PHP and iconv versions are you using?
I think it is the iconv version that is at fault - this server is using the glibc version instead of the libiconv version.
Thanks mercator, you were really helpful.
Thanks for your explanation as well. I didn't realise it wasn't just version numbers. The difference on my end was also due to the different iconv implentations.
|
2

Merged Cazuma Nii Cavalcanti's implementation with Junior Mayhé's char list, hoping to save some time for some of you.

function stripAccents($str) {
    return strtr(utf8_decode($str), utf8_decode('ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïñòóôõöøùúûüýÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƒƠơƯưǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǺǻǼǽǾǿ'), 'AAAAAAAECEEEEIIIIDNOOOOOOUUUUYsaaaaaaaeceeeeiiiinoooooouuuuyyAaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgHhHhIiIiIiIiIiIJijJjKkLlLlLlLlllNnNnNnnOoOoOoOEoeRrRrRrSsSsSsSsTtTtTtUuUuUuUuUuUuWwYyYZzZzZzsfOoUuAaIiOoUuUuUuUuUuAaAEaeOo');
}

1 Comment

You are trying to utf8_decode non latin1 characters which will give you back the '?' character. Also your strings are different lengths because of the Ǽ to AE conversions. This method wont work.
2

In laravel you can simply use str_slug($accentedPhrase) and if you care about dash (-) that this method substitute with space you can use str_replace('-', ' ', str_slug($accentedPhrase))

2 Comments

you don't need to use replace, you can set secont argument to blank string str_slug($word, ' ');
Seems str_slug is already deprecated by now; you can use Str::slug() instead as stated in documentation. As said in another comment, second parameter is combining character. So you can use as Str::slug('$accentedPhrase', ' '). Also don't forget to import Illuminate\Support\Str
2

Something like this?

$arrSearch  = explode(","," ,ç,æ, œ, á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,e,i,ø,u");

$arrReplace = explode(",","_,c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,e,i,o,u");

$output = str_replace($arrSearch, $arrReplace, $input);

Comments

2

If the main task is just to use the string in a URL, why not to use slugyfier?

composer require cocur/slugify

then

use Cocur\Slugify\Slugify;

$slugify = new Slugify();
echo $slugify->slugify('Fóø Bår');

It also has many bridges for popular frameworks. E.g. you can use Doctrine Extensions Sluggable behaviour to generate automatically unique slug for each entity in DB and use it in URL.

If you want just to wipe out all accents you can play around with rulesets to satisfy the requirements.

Comments

2
function Unaccent($string){
    return preg_replace('~&([a-z]{1,2}) (acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_QUOTES, 'UTF-8'));
}

2 Comments

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.
1

This answer I've got following tips here, so it is not really mine. It works for me using LATIN1 or UTF-8. If you use other charsets, you probably should add them to mb_detect_encoding function. Correct environment set is probably needed also.

function NoAccents($s){
        return iconv(mb_detect_encoding($s,'UTF-8, ASCII, ISO-8859-1'),'ASCII//TRANSLIT//INGORE',$s);
}

1 Comment

For Fóø Bår, I actually only got Fo? Bar. Could not have ø character translated to o. Tried changing my environment to no_NO, da_DK, but it did not interfere. Using setlocale(LC_CTYPE,'da_DK') I've got Fo? Baar.
1

Combining some of the answers:

/**
 * Given an utf8 string, returns an ascii string.
 * Only supports ISO-8859-1 chars, not chars like č, œ, Œ, ř, Š, š, ů, Ÿ or ž
 * @param string $utf8
 * @return string ascii
 */
function stripAccents(string $utf8): string
{
    $src = 'àáâãäåçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ';
    $dst = 'aaaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY';
    $iso8859_1 = strtr(
        utf8_decode($utf8),
        utf8_decode($src),
        $dst
    );
    // iconv below not used for UTF-8 directly because it adds chars like ^'"`
    // which are nice to not lose info, but not very readable and not compatible with filenames
    // iconv() additionally converts chars like
    // ¡ ¢ £  ¥   ¦ §  ¨  ©  ª «  ¬   SHY ®    °  ±   ²  ³  ´ µ ¶ · ¸ ¹  º »  ¼   ½   ¾   ¿ Å Æ  Ð Ø Þ  ß  æ  ÷   ø þ
    // ! c lb yen | SS " (c) a << not SHY (R)  ^0 +/- ^2 ^3 ' u P . , ^1 o >> 1/4 1/2 3/4 ? A AE D O Th ss ae d : o th
    return iconv('ISO-8859-1', 'ASCII//TRANSLIT//IGNORE', $iso8859_1);
}

Comments

1

Since Symfony 5+

composer require symfony/string 
use Symfony\Component\String\Slugger\AsciiSlugger;

$slugger = new AsciiSlugger();
$slug = $slugger->slug('Wôrķšƥáçè ~~sèťtïñğš~~');
// $slug = 'Workspace-settings'

More: https://symfony.com/doc/current/components/string.html#slugger

Comments

0

You can use an array key => value style to use with strtr() safely for UTF-8 characters even if they are multi-bytes.

function no_accent($str){
    $accents = array('À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'Ä' => 'A', 'Å' => 'A', 'à' => 'a', 'á' => 'a', 'â' => 'a', 'ã' => 'a', 'ä' => 'a', 'å' => 'a', 'Ā' => 'A', 'ā' => 'a', 'Ă' => 'A', 'ă' => 'a', 'Ą' => 'A', 'ą' => 'a', 'Ç' => 'C', 'ç' => 'c', 'Ć' => 'C', 'ć' => 'c', 'Ĉ' => 'C', 'ĉ' => 'c', 'Ċ' => 'C', 'ċ' => 'c', 'Č' => 'C', 'č' => 'c', 'Ð' => 'D', 'ð' => 'd', 'Ď' => 'D', 'ď' => 'd', 'Đ' => 'D', 'đ' => 'd', 'È' => 'E', 'É' => 'E', 'Ê' => 'E', 'Ë' => 'E', 'è' => 'e', 'é' => 'e', 'ê' => 'e', 'ë' => 'e', 'Ē' => 'E', 'ē' => 'e', 'Ĕ' => 'E', 'ĕ' => 'e', 'Ė' => 'E', 'ė' => 'e', 'Ę' => 'E', 'ę' => 'e', 'Ě' => 'E', 'ě' => 'e', 'Ĝ' => 'G', 'ĝ' => 'g', 'Ğ' => 'G', 'ğ' => 'g', 'Ġ' => 'G', 'ġ' => 'g', 'Ģ' => 'G', 'ģ' => 'g', 'Ĥ' => 'H', 'ĥ' => 'h', 'Ħ' => 'H', 'ħ' => 'h', 'Ì' => 'I', 'Í' => 'I', 'Î' => 'I', 'Ï' => 'I', 'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i', 'Ĩ' => 'I', 'ĩ' => 'i', 'Ī' => 'I', 'ī' => 'i', 'Ĭ' => 'I', 'ĭ' => 'i', 'Į' => 'I', 'į' => 'i', 'İ' => 'I', 'ı' => 'i', 'Ĵ' => 'J', 'ĵ' => 'j', 'Ķ' => 'K', 'ķ' => 'k', 'ĸ' => 'k', 'Ĺ' => 'L', 'ĺ' => 'l', 'Ļ' => 'L', 'ļ' => 'l', 'Ľ' => 'L', 'ľ' => 'l', 'Ŀ' => 'L', 'ŀ' => 'l', 'Ł' => 'L', 'ł' => 'l', 'Ñ' => 'N', 'ñ' => 'n', 'Ń' => 'N', 'ń' => 'n', 'Ņ' => 'N', 'ņ' => 'n', 'Ň' => 'N', 'ň' => 'n', 'ʼn' => 'n', 'Ŋ' => 'N', 'ŋ' => 'n', 'Ò' => 'O', 'Ó' => 'O', 'Ô' => 'O', 'Õ' => 'O', 'Ö' => 'O', 'Ø' => 'O', 'ò' => 'o', 'ó' => 'o', 'ô' => 'o', 'õ' => 'o', 'ö' => 'o', 'ø' => 'o', 'Ō' => 'O', 'ō' => 'o', 'Ŏ' => 'O', 'ŏ' => 'o', 'Ő' => 'O', 'ő' => 'o', 'Ŕ' => 'R', 'ŕ' => 'r', 'Ŗ' => 'R', 'ŗ' => 'r', 'Ř' => 'R', 'ř' => 'r', 'Ś' => 'S', 'ś' => 's', 'Ŝ' => 'S', 'ŝ' => 's', 'Ş' => 'S', 'ş' => 's', 'Š' => 'S', 'š' => 's', 'ſ' => 's', 'Ţ' => 'T', 'ţ' => 't', 'Ť' => 'T', 'ť' => 't', 'Ŧ' => 'T', 'ŧ' => 't', 'Ù' => 'U', 'Ú' => 'U', 'Û' => 'U', 'Ü' => 'U', 'ù' => 'u', 'ú' => 'u', 'û' => 'u', 'ü' => 'u', 'Ũ' => 'U', 'ũ' => 'u', 'Ū' => 'U', 'ū' => 'u', 'Ŭ' => 'U', 'ŭ' => 'u', 'Ů' => 'U', 'ů' => 'u', 'Ű' => 'U', 'ű' => 'u', 'Ų' => 'U', 'ų' => 'u', 'Ŵ' => 'W', 'ŵ' => 'w', 'Ý' => 'Y', 'ý' => 'y', 'ÿ' => 'y', 'Ŷ' => 'Y', 'ŷ' => 'y', 'Ÿ' => 'Y', 'Ź' => 'Z', 'ź' => 'z', 'Ż' => 'Z', 'ż' => 'z', 'Ž' => 'Z', 'ž' => 'z');
    return strtr($str, $accents);
}

Plus, you save decode/encode in UTF-8 part.

Comments

0

An improved version of remove_accents() function according to last version Wordpress 4.3 formatting is:

function mbstring_binary_safe_encoding( $reset = false ) {
    static $encodings = array();
    static $overloaded = null;

    if ( is_null( $overloaded ) )
        $overloaded = function_exists( 'mb_internal_encoding' ) && ( ini_get( 'mbstring.func_overload' ) & 2 );

    if ( false === $overloaded )
        return;

    if ( ! $reset ) {
        $encoding = mb_internal_encoding();
        array_push( $encodings, $encoding );
        mb_internal_encoding( 'ISO-8859-1' );
    }

    if ( $reset && $encodings ) {
        $encoding = array_pop( $encodings );
        mb_internal_encoding( $encoding );
    }
}

function reset_mbstring_encoding() {
    mbstring_binary_safe_encoding( true );
}

function seems_utf8( $str ) {
    mbstring_binary_safe_encoding();
    $length = strlen($str);
    reset_mbstring_encoding();
    for ($i=0; $i < $length; $i++) {
        $c = ord($str[$i]);
        if ($c < 0x80) $n = 0; // 0bbbbbbb
        elseif (($c & 0xE0) == 0xC0) $n=1; // 110bbbbb
        elseif (($c & 0xF0) == 0xE0) $n=2; // 1110bbbb
        elseif (($c & 0xF8) == 0xF0) $n=3; // 11110bbb
        elseif (($c & 0xFC) == 0xF8) $n=4; // 111110bb
        elseif (($c & 0xFE) == 0xFC) $n=5; // 1111110b
        else return false; // Does not match any model
        for ($j=0; $j<$n; $j++) { // n bytes matching 10bbbbbb follow ?
            if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80))
                return false;
        }
    }
    return true;
}

function remove_accents( $string ) {
    if ( !preg_match('/[\x80-\xff]/', $string) )
        return $string;

    if (seems_utf8($string)) {
        $chars = array(
            // Decompositions for Latin-1 Supplement
            chr(194).chr(170) => 'a', chr(194).chr(186) => 'o',
            chr(195).chr(128) => 'A', chr(195).chr(129) => 'A',
            chr(195).chr(130) => 'A', chr(195).chr(131) => 'A',
            chr(195).chr(132) => 'A', chr(195).chr(133) => 'A',
            chr(195).chr(134) => 'AE',chr(195).chr(135) => 'C',
            chr(195).chr(136) => 'E', chr(195).chr(137) => 'E',
            chr(195).chr(138) => 'E', chr(195).chr(139) => 'E',
            chr(195).chr(140) => 'I', chr(195).chr(141) => 'I',
            chr(195).chr(142) => 'I', chr(195).chr(143) => 'I',
            chr(195).chr(144) => 'D', chr(195).chr(145) => 'N',
            chr(195).chr(146) => 'O', chr(195).chr(147) => 'O',
            chr(195).chr(148) => 'O', chr(195).chr(149) => 'O',
            chr(195).chr(150) => 'O', chr(195).chr(153) => 'U',
            chr(195).chr(154) => 'U', chr(195).chr(155) => 'U',
            chr(195).chr(156) => 'U', chr(195).chr(157) => 'Y',
            chr(195).chr(158) => 'TH',chr(195).chr(159) => 's',
            chr(195).chr(160) => 'a', chr(195).chr(161) => 'a',
            chr(195).chr(162) => 'a', chr(195).chr(163) => 'a',
            chr(195).chr(164) => 'a', chr(195).chr(165) => 'a',
            chr(195).chr(166) => 'ae',chr(195).chr(167) => 'c',
            chr(195).chr(168) => 'e', chr(195).chr(169) => 'e',
            chr(195).chr(170) => 'e', chr(195).chr(171) => 'e',
            chr(195).chr(172) => 'i', chr(195).chr(173) => 'i',
            chr(195).chr(174) => 'i', chr(195).chr(175) => 'i',
            chr(195).chr(176) => 'd', chr(195).chr(177) => 'n',
            chr(195).chr(178) => 'o', chr(195).chr(179) => 'o',
            chr(195).chr(180) => 'o', chr(195).chr(181) => 'o',
            chr(195).chr(182) => 'o', chr(195).chr(184) => 'o',
            chr(195).chr(185) => 'u', chr(195).chr(186) => 'u',
            chr(195).chr(187) => 'u', chr(195).chr(188) => 'u',
            chr(195).chr(189) => 'y', chr(195).chr(190) => 'th',
            chr(195).chr(191) => 'y', chr(195).chr(152) => 'O',
            // Decompositions for Latin Extended-A
            chr(196).chr(128) => 'A', chr(196).chr(129) => 'a',
            chr(196).chr(130) => 'A', chr(196).chr(131) => 'a',
            chr(196).chr(132) => 'A', chr(196).chr(133) => 'a',
            chr(196).chr(134) => 'C', chr(196).chr(135) => 'c',
            chr(196).chr(136) => 'C', chr(196).chr(137) => 'c',
            chr(196).chr(138) => 'C', chr(196).chr(139) => 'c',
            chr(196).chr(140) => 'C', chr(196).chr(141) => 'c',
            chr(196).chr(142) => 'D', chr(196).chr(143) => 'd',
            chr(196).chr(144) => 'D', chr(196).chr(145) => 'd',
            chr(196).chr(146) => 'E', chr(196).chr(147) => 'e',
            chr(196).chr(148) => 'E', chr(196).chr(149) => 'e',
            chr(196).chr(150) => 'E', chr(196).chr(151) => 'e',
            chr(196).chr(152) => 'E', chr(196).chr(153) => 'e',
            chr(196).chr(154) => 'E', chr(196).chr(155) => 'e',
            chr(196).chr(156) => 'G', chr(196).chr(157) => 'g',
            chr(196).chr(158) => 'G', chr(196).chr(159) => 'g',
            chr(196).chr(160) => 'G', chr(196).chr(161) => 'g',
            chr(196).chr(162) => 'G', chr(196).chr(163) => 'g',
            chr(196).chr(164) => 'H', chr(196).chr(165) => 'h',
            chr(196).chr(166) => 'H', chr(196).chr(167) => 'h',
            chr(196).chr(168) => 'I', chr(196).chr(169) => 'i',
            chr(196).chr(170) => 'I', chr(196).chr(171) => 'i',
            chr(196).chr(172) => 'I', chr(196).chr(173) => 'i',
            chr(196).chr(174) => 'I', chr(196).chr(175) => 'i',
            chr(196).chr(176) => 'I', chr(196).chr(177) => 'i',
            chr(196).chr(178) => 'IJ',chr(196).chr(179) => 'ij',
            chr(196).chr(180) => 'J', chr(196).chr(181) => 'j',
            chr(196).chr(182) => 'K', chr(196).chr(183) => 'k',
            chr(196).chr(184) => 'k', chr(196).chr(185) => 'L',
            chr(196).chr(186) => 'l', chr(196).chr(187) => 'L',
            chr(196).chr(188) => 'l', chr(196).chr(189) => 'L',
            chr(196).chr(190) => 'l', chr(196).chr(191) => 'L',
            chr(197).chr(128) => 'l', chr(197).chr(129) => 'L',
            chr(197).chr(130) => 'l', chr(197).chr(131) => 'N',
            chr(197).chr(132) => 'n', chr(197).chr(133) => 'N',
            chr(197).chr(134) => 'n', chr(197).chr(135) => 'N',
            chr(197).chr(136) => 'n', chr(197).chr(137) => 'N',
            chr(197).chr(138) => 'n', chr(197).chr(139) => 'N',
            chr(197).chr(140) => 'O', chr(197).chr(141) => 'o',
            chr(197).chr(142) => 'O', chr(197).chr(143) => 'o',
            chr(197).chr(144) => 'O', chr(197).chr(145) => 'o',
            chr(197).chr(146) => 'OE',chr(197).chr(147) => 'oe',
            chr(197).chr(148) => 'R',chr(197).chr(149) => 'r',
            chr(197).chr(150) => 'R',chr(197).chr(151) => 'r',
            chr(197).chr(152) => 'R',chr(197).chr(153) => 'r',
            chr(197).chr(154) => 'S',chr(197).chr(155) => 's',
            chr(197).chr(156) => 'S',chr(197).chr(157) => 's',
            chr(197).chr(158) => 'S',chr(197).chr(159) => 's',
            chr(197).chr(160) => 'S', chr(197).chr(161) => 's',
            chr(197).chr(162) => 'T', chr(197).chr(163) => 't',
            chr(197).chr(164) => 'T', chr(197).chr(165) => 't',
            chr(197).chr(166) => 'T', chr(197).chr(167) => 't',
            chr(197).chr(168) => 'U', chr(197).chr(169) => 'u',
            chr(197).chr(170) => 'U', chr(197).chr(171) => 'u',
            chr(197).chr(172) => 'U', chr(197).chr(173) => 'u',
            chr(197).chr(174) => 'U', chr(197).chr(175) => 'u',
            chr(197).chr(176) => 'U', chr(197).chr(177) => 'u',
            chr(197).chr(178) => 'U', chr(197).chr(179) => 'u',
            chr(197).chr(180) => 'W', chr(197).chr(181) => 'w',
            chr(197).chr(182) => 'Y', chr(197).chr(183) => 'y',
            chr(197).chr(184) => 'Y', chr(197).chr(185) => 'Z',
            chr(197).chr(186) => 'z', chr(197).chr(187) => 'Z',
            chr(197).chr(188) => 'z', chr(197).chr(189) => 'Z',
            chr(197).chr(190) => 'z', chr(197).chr(191) => 's',
            // Decompositions for Latin Extended-B
            chr(200).chr(152) => 'S', chr(200).chr(153) => 's',
            chr(200).chr(154) => 'T', chr(200).chr(155) => 't',
            // Euro Sign
            chr(226).chr(130).chr(172) => 'E',
            // GBP (Pound) Sign
            chr(194).chr(163) => '',
            // Vowels with diacritic (Vietnamese)
            // unmarked
            chr(198).chr(160) => 'O', chr(198).chr(161) => 'o',
            chr(198).chr(175) => 'U', chr(198).chr(176) => 'u',
            // grave accent
            chr(225).chr(186).chr(166) => 'A', chr(225).chr(186).chr(167) => 'a',
            chr(225).chr(186).chr(176) => 'A', chr(225).chr(186).chr(177) => 'a',
            chr(225).chr(187).chr(128) => 'E', chr(225).chr(187).chr(129) => 'e',
            chr(225).chr(187).chr(146) => 'O', chr(225).chr(187).chr(147) => 'o',
            chr(225).chr(187).chr(156) => 'O', chr(225).chr(187).chr(157) => 'o',
            chr(225).chr(187).chr(170) => 'U', chr(225).chr(187).chr(171) => 'u',
            chr(225).chr(187).chr(178) => 'Y', chr(225).chr(187).chr(179) => 'y',
            // hook
            chr(225).chr(186).chr(162) => 'A', chr(225).chr(186).chr(163) => 'a',
            chr(225).chr(186).chr(168) => 'A', chr(225).chr(186).chr(169) => 'a',
            chr(225).chr(186).chr(178) => 'A', chr(225).chr(186).chr(179) => 'a',
            chr(225).chr(186).chr(186) => 'E', chr(225).chr(186).chr(187) => 'e',
            chr(225).chr(187).chr(130) => 'E', chr(225).chr(187).chr(131) => 'e',
            chr(225).chr(187).chr(136) => 'I', chr(225).chr(187).chr(137) => 'i',
            chr(225).chr(187).chr(142) => 'O', chr(225).chr(187).chr(143) => 'o',
            chr(225).chr(187).chr(148) => 'O', chr(225).chr(187).chr(149) => 'o',
            chr(225).chr(187).chr(158) => 'O', chr(225).chr(187).chr(159) => 'o',
            chr(225).chr(187).chr(166) => 'U', chr(225).chr(187).chr(167) => 'u',
            chr(225).chr(187).chr(172) => 'U', chr(225).chr(187).chr(173) => 'u',
            chr(225).chr(187).chr(182) => 'Y', chr(225).chr(187).chr(183) => 'y',
            // tilde
            chr(225).chr(186).chr(170) => 'A', chr(225).chr(186).chr(171) => 'a',
            chr(225).chr(186).chr(180) => 'A', chr(225).chr(186).chr(181) => 'a',
            chr(225).chr(186).chr(188) => 'E', chr(225).chr(186).chr(189) => 'e',
            chr(225).chr(187).chr(132) => 'E', chr(225).chr(187).chr(133) => 'e',
            chr(225).chr(187).chr(150) => 'O', chr(225).chr(187).chr(151) => 'o',
            chr(225).chr(187).chr(160) => 'O', chr(225).chr(187).chr(161) => 'o',
            chr(225).chr(187).chr(174) => 'U', chr(225).chr(187).chr(175) => 'u',
            chr(225).chr(187).chr(184) => 'Y', chr(225).chr(187).chr(185) => 'y',
            // acute accent
            chr(225).chr(186).chr(164) => 'A', chr(225).chr(186).chr(165) => 'a',
            chr(225).chr(186).chr(174) => 'A', chr(225).chr(186).chr(175) => 'a',
            chr(225).chr(186).chr(190) => 'E', chr(225).chr(186).chr(191) => 'e',
            chr(225).chr(187).chr(144) => 'O', chr(225).chr(187).chr(145) => 'o',
            chr(225).chr(187).chr(154) => 'O', chr(225).chr(187).chr(155) => 'o',
            chr(225).chr(187).chr(168) => 'U', chr(225).chr(187).chr(169) => 'u',
            // dot below
            chr(225).chr(186).chr(160) => 'A', chr(225).chr(186).chr(161) => 'a',
            chr(225).chr(186).chr(172) => 'A', chr(225).chr(186).chr(173) => 'a',
            chr(225).chr(186).chr(182) => 'A', chr(225).chr(186).chr(183) => 'a',
            chr(225).chr(186).chr(184) => 'E', chr(225).chr(186).chr(185) => 'e',
            chr(225).chr(187).chr(134) => 'E', chr(225).chr(187).chr(135) => 'e',
            chr(225).chr(187).chr(138) => 'I', chr(225).chr(187).chr(139) => 'i',
            chr(225).chr(187).chr(140) => 'O', chr(225).chr(187).chr(141) => 'o',
            chr(225).chr(187).chr(152) => 'O', chr(225).chr(187).chr(153) => 'o',
            chr(225).chr(187).chr(162) => 'O', chr(225).chr(187).chr(163) => 'o',
            chr(225).chr(187).chr(164) => 'U', chr(225).chr(187).chr(165) => 'u',
            chr(225).chr(187).chr(176) => 'U', chr(225).chr(187).chr(177) => 'u',
            chr(225).chr(187).chr(180) => 'Y', chr(225).chr(187).chr(181) => 'y',
            // Vowels with diacritic (Chinese, Hanyu Pinyin)
            chr(201).chr(145) => 'a',
            // macron
            chr(199).chr(149) => 'U', chr(199).chr(150) => 'u',
            // acute accent
            chr(199).chr(151) => 'U', chr(199).chr(152) => 'u',
            // caron
            chr(199).chr(141) => 'A', chr(199).chr(142) => 'a',
            chr(199).chr(143) => 'I', chr(199).chr(144) => 'i',
            chr(199).chr(145) => 'O', chr(199).chr(146) => 'o',
            chr(199).chr(147) => 'U', chr(199).chr(148) => 'u',
            chr(199).chr(153) => 'U', chr(199).chr(154) => 'u',
            // grave accent
            chr(199).chr(155) => 'U', chr(199).chr(156) => 'u',
        );

        $string = strtr($string, $chars);
    } else {
        $chars = array();
        // Assume ISO-8859-1 if not UTF-8
        $chars['in'] = chr(128).chr(131).chr(138).chr(142).chr(154).chr(158)
            .chr(159).chr(162).chr(165).chr(181).chr(192).chr(193).chr(194)
            .chr(195).chr(196).chr(197).chr(199).chr(200).chr(201).chr(202)
            .chr(203).chr(204).chr(205).chr(206).chr(207).chr(209).chr(210)
            .chr(211).chr(212).chr(213).chr(214).chr(216).chr(217).chr(218)
            .chr(219).chr(220).chr(221).chr(224).chr(225).chr(226).chr(227)
            .chr(228).chr(229).chr(231).chr(232).chr(233).chr(234).chr(235)
            .chr(236).chr(237).chr(238).chr(239).chr(241).chr(242).chr(243)
            .chr(244).chr(245).chr(246).chr(248).chr(249).chr(250).chr(251)
            .chr(252).chr(253).chr(255);

        $chars['out'] = "EfSZszYcYuAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy";

        $string = strtr($string, $chars['in'], $chars['out']);
        $double_chars = array();
        $double_chars['in'] = array(chr(140), chr(156), chr(198), chr(208), chr(222), chr(223), chr(230), chr(240), chr(254));
        $double_chars['out'] = array('OE', 'oe', 'AE', 'DH', 'TH', 'ss', 'ae', 'dh', 'th');
        $string = str_replace($double_chars['in'], $double_chars['out'], $string);
    }

    return $string;
}

My answer is an update of @dynamic solution since Romanian or perhaps other language diacritics weren't converted. I wrote the minimum functions and works like a charm.

print_r(remove_accents('Iași, Iași County, Romania'));

Comments

0
<?php
/* 
 * Thanks:
 *   - The idea of extracting accents equiv chars with the help of the HTMLSpecialChars convertion was taking from ICanBoogie Package of 'Olivier Laviale' {@link http://www.weirdog.com/blog/php/supprimer-les-accents-des-caracteres-accentues.html}
*/
function accentCharsModifier($str){
    if(($length=mb_strlen($str,"UTF-8"))<strlen($str)){
        $i=$count=0;
        while($i<$length){
            if(strlen($c=mb_substr($str,$i,1,"UTF-8"))>1){
                $he=htmlentities($c); 
                if(($nC=preg_replace("#&([A-Za-z])(?:acute|cedil|caron|circ|grave|orn|ring|slash|th|tilde|uml);#", "\\1", $he))!=$he ||
                    ($nC=preg_replace("#&([A-Za-z]{2})(?:lig);#", "\\1", $he))!=$he ||
                    ($nC=preg_replace("#&[^;]+;#", "", $he))!=$he){
                    $str=str_replace($c,$nC,$str,$count);if($nC==""){$length=$length-$count;$i--;}
                }
            }
            $i++;
        }
    }
    return $str;
}
echo accentCharsModifier("&éôpkAÈû");
?>

Comments

0

Based on @Mimouni answer I made this function to transliterate Accented strings to Non Accented strings.

/**
 * @param $str Convert string to lowercase and replace special chars to equivalents ou remove its
 * @return string
 */
function _slugify(string $string): string
{
    $str = $string; // for comparisons
    $str = _toUtf8($str); // Force to work with string in UTF-8
    $str = iconv('UTF-8', 'ASCII//TRANSLIT', $str);

    if ($str != htmlentities($string, ENT_QUOTES, 'UTF-8')) { // iconv fails
        $str = _toUtf8($string);
        $str = htmlentities($str, ENT_QUOTES, 'UTF-8');
        $str = preg_replace('#&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);#i', '$1', $str);
        // Need to strip non ASCII chars or any other than a-z, A-Z, 0-9...
        $str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
        $str = preg_replace(array('#[^0-9a-z]#i', '#[ -]+#'), ' ', $str);
        $str = trim($str, ' -');
    }

    // lowercase
    $string = strtolower($str);

    return $string;
}

To convert strings to UTF-8, here I use the Multi Byte String extension. Note that I break string in pieces to avoid trouble with mixed content (I have such situation) and convert word by word.

/**
 * @param $str string String in any encoding
 * @return string
 */
function _toUtf8(string $str_in): ?string
{
    if (!function_exists('mb_detect_encoding')) {
        throw new \Exception('The Multi Byte String extension is absent!');
    }
    $str_out = [];
    $words = explode(" ", $str_in);
    foreach ($words as $word) {
        $current_encoding = mb_detect_encoding($word, 'UTF-8, ASCII, ISO-8859-1');
        $str_out[] = mb_convert_encoding($word, 'UTF-8', $current_encoding);
    }
    return implode(" ", $str_out);
}

Footer Notes: Was the only solution that pass in PHPUnit UnitTests in Windows command Line (locale issues) The @gabo solution should work but unfortunately not for me

Comments

0

Old topics but add what I found working very weel for me. You can customise the transliterator for your needs.

 $transliterator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Upper(); :: NFC;', Transliterator::FORWARD);

return $transliterator->transliterate('çÇæ λώπηξ-- É&é-è_çà=@46/,*')

Output: "CCAE LOPEX-- E&E-E_CA=@46/,*";

Doc: https://www.php.net/manual/en/class.transliterator.php

Comments

-1

One of the tricks I stumbled upon on the web was using htmlentities then stripping the encoded character :

$stripped = preg_replace('`&[^;]+;`','',htmlentities($string));

Not perfect but it does work well in some case.

But, you're writing about creating an URL string, so urlencode and its counterpart urldecode may be better. Or, if you are creating a query string, use this last function : http_build_query.

Comments

-1

WordPress' implementation is definitly the safest for UTF8 strings. For Latin1 strings, a simple strtr does the job, but ensure you're saving your script in LATIN1 format, not UTF-8.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.