Why does this str_ireplace () work with a non-ASCII string?

Note: I think that I probably know incorrectly, so please kindly correct my knowledge :)


I just answered the question about UTF-8 and PHP.

I suggested using str_ireplace('', '', $a) .

I did not expect this to work, but it did.

I always thought that PHP treats one byte as one character, so to use accurate results when using characters outside the ASCII range you need to use the mb_* functions.

I suggested that Russian characters will take 1 byte each.

I thought str_replace() would work, because bytes can be matched regardless of whether they are multi-byte or not, if they are ok.

I thought str_ireplace() would not work, because PHP would not know how to match non-ASCII characters to their equivalent alternative case. But it really worked.


Where and how am I mistaken? Provide me as much information as you can :)

+7
source share
3 answers

It works by creating lowercase text, passing it to libc functions, which depend on the locale settings; corresponding settings mean that the text will correctly contain case if the correct encoding is used for bytes.

+6
source

Another possible explanation. Unicode plans have similar attributes, such as the ISO-8859-1 range.

Converting a capital letter to lowercase simply requires the addition of 0x20 for the ASCII range:

 0x41 A 0x61 a 

And - I didn’t look for him - I think this is the same for the Latin-1 range at 0xC0-0xDF. And this coincidence can work for Russian letters in the Unicode range:

 d092d09ed09bd093d09ed093d0a0d090d094  d0b2d0bed0bbd0b3d0bed0b3d180d0b0d0b4  

The only difference is that in bytes added 0x20, which were considered L1 characters. So maybe this is just a local setup.

+3
source

On the other hand: PHP does not treat every character as a byte, but it treats every byte as a character. Thus, multiple characters are treated as multiple characters (and maybe not the one you expect).

0
source

All Articles