PHP function mb_detect_encoding strict mode

Question

PHP function mb_detect_encoding strict mode

The mb_detect_encoding function has a parameter for strict mode.

In the first, most common comment:

<?php $str = 'áéóú'; // ISO-8859-1 mb_detect_encoding($str, 'UTF-8'); // 'UTF-8' mb_detect_encoding($str, 'UTF-8', true); // false

It is true, yes. But can anyone give me an explanation why?

+6

php character-encoding

vaso123 Aug 24 '16 at 7:39

source share

3 answers

áéóú in ISO-8859-1 is encoded as:

 e1 e9 f3 fa

If you misinterpret it as UTF-8, you get only four invalid byte sequences. The Multi-Byte extension is mainly designed to ignore errors. For example, mb_convert_encoding() will replace these sequences with question marks or whatever you set with mb_substitute_character() .

My educated guess is that strict encoding determines what should be done with invalid byte sequences:

false means delete them
true means keep them

If you ignore these invalid sequences, you obviously discard extremely valuable information and you only get reasonable results in very limited circumstances, for example.

 $str = chr(81); var_dump( mb_detect_encoding($str, ['ISO-8859-1', 'Windows-1252']) ); var_dump( mb_detect_encoding($str, ['Windows-1252', 'ISO-8859-1']) );

To summarize, mb_detect_encoding() is usually not as useful as you can, and this is common shit with default options.

+2

Álvaro González Aug 24 '16 at 10:54

source share

Because $str not actual UTF-8 , but ISO-8859-1 . Since when not a strict comparison, UTF-8 can be interpreted in the same way as ISO-8859-1 , but when using strict mode, only the actual UTF-8 is suitable for comparing UTF-8 ( here )

-2

Justinas Aug 24 '16 at 7:47

source share

Paul crovella · Accepted Answer · 2016-08-24T11:30:21+0000

Everything in this answer is based on my reading of the code here and.

I did not write this, I did not go through it with the debugger, this is only my interpretation.

It seems that the intention was strict mode to check if the string as a whole was valid for encoding, while lax mode would allow a subsequence that could be part of a valid string. For example, if a string ends with what should be the first byte of a multibyte character, it will not match in strict mode, but it will still match UTF-8 in loose mode.

However, there seems to be an error *, where in lax mode, in some cases, only the first byte of the string is checked.

Example:

Byte 0xf8 not allowed anywhere in UTF-8. When placed at the beginning of a line, mb_detect_encoding() correctly returns false for it regardless of which mode is used.

 $str = "\xf8foo"; var_dump( mb_detect_encoding($str, 'UTF-8'), // bool(false) mb_detect_encoding($str, 'UTF-8', true) // bool(false) );

But while the leading byte can occur anywhere in the UTF-8 sequence, lax mode returns UTF-8.

 $str = "foo\xf8"; var_dump( mb_detect_encoding($str, 'UTF-8'), // string(5) "UTF-8" mb_detect_encoding($str, 'UTF-8', true) // bool(false) );

So, although your ISO-8859-1 'áéóú' not valid UTF-8, the first byte "\xe1" may occur in UTF-8 and mb_detect_encoding() erroneously returns the line as such.

* _{I opened a report for this at https://bugs.php.net/bug.php?id=72933}

PHP function mb_detect_encoding strict mode

More articles: