How to remove accents and turn letters into "simple" ASCII characters?

What is the most efficient way to remove accents from a string, for example. ÈâuÑ becomes Eaun ?

Is there a simple, built-in way that I'm missing or a regular expression?

+42
string php regex ascii
Aug 22 '10 at 18:21
source share
5 answers

If you installed iconv, try this (the example assumes your input line is in UTF-8):

 echo iconv('UTF-8', 'ASCII//TRANSLIT', $string); 

(iconv is a library for converting between all types of encodings, it is effective and is included in many PHP distributions by default. First of all, it is definitely simpler and more efficient than trying to overturn your own solution (did you know that there is a "Latin letter N with curled "? Me and not .))

+49
Aug 22 '10 at 18:27
source share

I found a solution that worked in all my test cases (copied from http://php.net/manual/en/transliterator.transliterate.php ):

 var_dump(transliterator_transliterate('Any-Latin; Latin-ASCII; [\u0080-\u7fff] remove', "A æ Übérmensch på høyeste nivå!    PHP! . fi ¦")); // string(50) "A ae Ubermensch pa hoyeste niva! I a lublu PHP! est. fi " 

see http://www.php.net/normalizer

EDIT: This solution is not dependent on a set of locales using setlocale (). Another advantage over iconv () is that even non-Latin characters are not ignored.

EDIT2: I found that there are some characters that are not covered by the transliteration I published originally. Any-Latin translates a Cyrillic character to a character that does not fit into the Latin character set: ʹ ( http://en.wikipedia.org/wiki/Prime_%28symbol%29 ). I added [\u0100-\u7fff] remove to remove all these non-Latin characters. I also added a test to the text;)

I suggest that they mean the Latin alphabet, and not one of the Latin characters in Latin here. But in any case - in my opinion, they should transliterate it into something ASCII, and then in Latin-ASCII ...

EDIT3: Sorry for another change here. I had to take characters up to u0080 instead of u0100 to get only ASCII characters as output. The above test has been updated.

+38
Apr 15 '13 at 18:40
source share

By posting this on @palantir ...

I find iconv completely untrustworthy and I don't like preg_replace solutions and large arrays ... so my favorite way (and the only reliable method I found) is ...

 function toASCII( $str ) { return strtr(utf8_decode($str), utf8_decode( 'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'), 'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy'); } 
+22
Jul 28 '11 at 10:51
source share

You can use iconv to transliterate characters into regular US-ASCII, and then use a regular expression to remove non-alphabetic characters:

 preg_replace('/[^az]/i', '', iconv("UTF-8", "US-ASCII//TRANSLIT", $text)) 

Another way would be to use Normalizer to normalize to KD Form Normalization (NFKD) , and then remove the label characters:

 preg_replace('/\p{Mn}/u', '', Normalizer::normalize($text, Normalizer::FORM_KD)) 
+13
Aug 22 '10 at 18:28
source share

Note. I am rewriting this from another similar question in the hope that it will be useful to others.

As a result, I wrote a PHP library based on URLify.js from the Django project, since I found that iconv () is too incomplete. You can find it here:

https://github.com/jbroadway/urlify

Processes Latin letters, as well as Greek, Turkish, Russian, Ukrainian, Czech, Polish and Latvian.

+12
May 01 '12 at 10:47 p.m.
source share



All Articles