Strange PHP UTF-8 Behavior

I have the following PHP test code:

header('Content-type: text/html; charset=utf-8'); $text = 'Développeur Web'; var_dump($text); $text = preg_replace('#[^\\pL\d]+#u', '-', $text); var_dump($text); $text = trim($text, '-'); var_dump($text); $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text); var_dump($text); $text = strtolower($text); var_dump($text); $text = preg_replace('#[^-\w]+#', '', $text); var_dump($text); 

On my local machine, it works as expected:

 string(16) "Développeur Web" string(16) "Développeur-Web" string(16) "Développeur-Web" string(16) "D'eveloppeur-Web" string(16) "d'eveloppeur-web" string(15) "developpeur-web" 

but on my real server it behaves strangely:

 string 'Développeur Web' (length=16) string '-pp-' (length=4) string 'pp' (length=2) string 'pp' (length=2) string 'pp' (length=2) string 'pp' (length=2) 

The local machine is Windows running PHP version 5.2.4, and the live server is CentOS, running PHP version 5.2.10, so they are not identical in any way, not the ideal that I know.

Has anyone experienced something like this and can point me in the right direction? I assume this is some kind of server or PHP configuration related to UTF-8 or locale.

Thank you very much in advance

+6
linux php apache preg-replace utf-8
source share
1 answer

Must not be

 $text = preg_replace('#[^\pL\d]+#u', '-', $text); 

on line 6. If you exit \ , you will have a literal \ in your class exception. Therefore, the regular expression [^\\pL\d]+ finds one or more occurrences of a character that is not a \ , p , L or digit. This explains why the "Développeur Web" will be reduced to "-pp-" - all up to the first p match and will be replaced with - ; the same is true for everything after the second p .

Perhaps there is a difference between the two machines in how the shielded \ handled.

EDIT after OP comment:

In fact, escaping is not a problem here - both versions are handled the same way. Actually, the problem is that the PCRE version used does not support unicode properties and was not compiled with --enable-unicode-properties .

+2
source share

All Articles