Multibyte strings and weird error

Why does the following code behave differently for different multibyte strings?

echo preg_replace('@(?=\pL)@u', '*', 'Ω…'); // prints: '*Ω…' βœ“ echo preg_replace('@(?=\pL)@u', '*', 'ΨΆ'); // prints: '*ΨΆ' βœ“ echo preg_replace('@(?=\pL)@u', '*', 'ΨΊ'); // prints: '* * ' βœ— echo preg_replace('@(?=\pL)@u', '*', 'Ψ΅'); // prints: '* * ' βœ— 

See: http://3v4l.org/fvab1

+4
source share
1 answer

You also need to enter modifier letters ( Lm ). See the following iteration script for the entire Arabic Unicode block:

 <?php function uchar_2($dec) { $utf = chr(192 + (($dec - ($dec % 64)) / 64)); $utf .= chr(128 + ($dec % 64)); return $utf; } $issues = 0; $count = 0; for ($dec = 1536; $dec <= 1791; $dec++) { $char = uchar_2($dec); if (preg_replace('@^(?=\pLm) $@u ', '*', $char) !== $char) { printf("Issue with %s (%s)\n", $dec, $char); $issues++; } $count++; } printf("Found %d issues in %d rows\n", $issues, $count); 

Without Lm this will result in an error for about half the characters.

+2
source

All Articles