PHP function mb_detect_encoding strict mode

The mb_detect_encoding function has a parameter for strict mode.

In the first, most common comment:

<?php $str = 'áéóú'; // ISO-8859-1 mb_detect_encoding($str, 'UTF-8'); // 'UTF-8' mb_detect_encoding($str, 'UTF-8', true); // false 

It is true, yes. But can anyone give me an explanation why?

+6
source share
3 answers

Everything in this answer is based on my reading of the code here and.

I did not write this, I did not go through it with the debugger, this is only my interpretation.


It seems that the intention was strict mode to check if the string as a whole was valid for encoding, while lax mode would allow a subsequence that could be part of a valid string. For example, if a string ends with what should be the first byte of a multibyte character, it will not match in strict mode, but it will still match UTF-8 in loose mode.

However, there seems to be an error *, where in lax mode, in some cases, only the first byte of the string is checked.

Example:

Byte 0xf8 not allowed anywhere in UTF-8. When placed at the beginning of a line, mb_detect_encoding() correctly returns false for it regardless of which mode is used.

 $str = "\xf8foo"; var_dump( mb_detect_encoding($str, 'UTF-8'), // bool(false) mb_detect_encoding($str, 'UTF-8', true) // bool(false) ); 

But while the leading byte can occur anywhere in the UTF-8 sequence, lax mode returns UTF-8.

 $str = "foo\xf8"; var_dump( mb_detect_encoding($str, 'UTF-8'), // string(5) "UTF-8" mb_detect_encoding($str, 'UTF-8', true) // bool(false) ); 

So, although your ISO-8859-1 'áéóú' not valid UTF-8, the first byte "\xe1" may occur in UTF-8 and mb_detect_encoding() erroneously returns the line as such.


* I opened a report for this at https://bugs.php.net/bug.php?id=72933

+4
source

áéóú in ISO-8859-1 is encoded as:

 e1 e9 f3 fa 

If you misinterpret it as UTF-8, you get only four invalid byte sequences. The Multi-Byte extension is mainly designed to ignore errors. For example, mb_convert_encoding() will replace these sequences with question marks or whatever you set with mb_substitute_character() .

My educated guess is that strict encoding determines what should be done with invalid byte sequences:

  • false means delete them
  • true means keep them

If you ignore these invalid sequences, you obviously discard extremely valuable information and you only get reasonable results in very limited circumstances, for example.

 $str = chr(81); var_dump( mb_detect_encoding($str, ['ISO-8859-1', 'Windows-1252']) ); var_dump( mb_detect_encoding($str, ['Windows-1252', 'ISO-8859-1']) ); 

To summarize, mb_detect_encoding() is usually not as useful as you can, and this is common shit with default options.

+2
source

Because $str not actual UTF-8 , but ISO-8859-1 . Since when not a strict comparison, UTF-8 can be interpreted in the same way as ISO-8859-1 , but when using strict mode, only the actual UTF-8 is suitable for comparing UTF-8 ( here )

-2
source

All Articles