Strange mb_detect_order () behavior in PHP

I would like to determine the encoding of some text (using PHP). For this purpose I use the mb_detect_encoding () function.

The problem is that the function returns different results if I change the order of the possible encodings using the mb_detect_order () function.

Consider the following example.

$html = <<< STR ちょっとのアクセスで落ちてしまったり、サーバー障害が多いレンタルサーバーを選ぶとあなたのビジネス等にかなりの影響がでてしまう可能性があります。特に商売をされている個人の方、法人の方は気をつけるようにしてください STR; mb_detect_order(array('UTF-8','EUC-JP', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP','ISO-8859-1','ISO-8859-2')); $originalEncoding = mb_detect_encoding($str); die($originalEncoding); // $originalEncoding = 'UTF-8' 

However, if you change the encoding order in mb_detect_order (), the results will be different:

 mb_detect_order(array('EUC-JP','UTF-8', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP','ISO-8859-1','ISO-8859-2')); die($originalEncoding); // $originalEncoding = 'EUC-JP' 



So my questions are:
Why is this happening?
Is there a way in PHP to correctly and unequivocally detect text encoding?

+7
php encoding
source share
4 answers

What would I expect.

The detection algorithm probably just tries, in order, to specify the encodings you specified in mb_detect_order , and then returns the first one under which the byte stream will act.

Something more intelligent requires statistical methods (I think machine learning is commonly used).

EDIT: see this article for more intelligent methods.

Due to its importance, automatic character set detection is already implemented in major Internet applications such as Mozilla or Internet Explorer. They are very accurate and fast, but in the implementation in each case, many domain knowledge is applied. Unlike their methods, we sought a simple algorithm that can be applied uniformly to each encoding, and the algorithm is based on well established standard machine learning methods. We also studied the relationship between language recognition and character set and compared byte-based algorithms and character-based algorithms. We used Naive Bayes (NB) and Vector Machine Support (SVM).

+5
source share

Not really. Different encodings often have large areas of overlap, and if your string you are testing exists inside that overlap, then both encodings are acceptable.

For example, utf-8 and ISO-8859-1 are the same for the letters az. The string "hello" will have an identical sequence of bytes in both encodings.

That is why the mb_detect_order() function is the first mb_detect_order() to do, as it allows you to say what you would prefer when these collisions occur. Do you want hi to be utf-8 or ISO-8859-1?

+5
source share

Keep in mind that mb_detect_encoding() does not know what data encoding is. You can see the string, but the function itself sees a stream of bytes. Based on this, you need to guess what encoding is - for example, ASCII will be if the bytes are only in the range 0-127, UTF-8 will be if there are ASCII bytes and 128 + bytes that exist only in pairs or more, etc. d.

As you can imagine, given this context, it is quite difficult to detect coding reliably.

Like rihk , this is what the mb_detect_order() function is for - you basically supply your best guess as to what the data will be. Do you often work with UTF-8 files? Then, most likely, your things are unlikely to be UTF-16, even if mb_detect_encoding() can guess it like that.

You can also check the Artefacto link for a more -depth view.

Example example : Internet Explorer uses some interesting encoding assumptions if nothing is specified (@link, Section: “Automatically detect website language”), which caused strange behavior on sites that in the past took the encoding for granted. You can probably find some funny things if you google around. This makes a good demo of how even statistical methods can lead to terrible errors, and why encoding guessing is generally problematic.

+2
source share

mb_detect_encoding looks at the first charset entry in your mb_detect_order (), and then iterates over your $ html input character by character, whether that character matches a valid character set for encoding. If each character matches, then it returns true; if any character fails, it proceeds to the next encoding in mb_detect_order () and retries.

The vikipedia encoding list is a good place to see the characters that make up each encoding.

Since these encoding values ​​overlap (char x8fA1EF exists in both "UTF-8" and "EUC-JP"), this will be considered a coincidence, even if it is a completely different character in each character set. Therefore, if none of the character values ​​exists in one encoding but not in another, then mb_detect_encoding cannot determine which of the encodings is invalid; and will return the first encoding from the list of arrays, which may be valid.

As far as I know, there is no clear way to identify the encoding. The best guessing method in PHP can help if you have a reasonable idea of ​​which encodings you are likely to encounter, and order your list accordingly based on spaces (invalid characters) in each encoding. The best solution is to “know” the encoding. If you are cleaning your html from another page, find the encoding identifier in the header of this page.

If you really want to be smart, you can try and determine the language in which html is written, possibly using trigrams or n-grams or the like, as described in this PHP / ir article .

+1
source share

All Articles