Q: another inconvenient truth
Unable to detect encoding of unknown text with 100% accuracy and / or certainty.
In practice, there will be cases across the entire spectrum of possible results: you can be sure that the multilingual text in UTF-8 will be correctly detected as such, while it is impossible to determine which of the ISO family Encodings -8859 correspond to some text - and if you if you donβt want to do statistical analysis, itβs even impossible to make an educated guess!
What do we need to work with?
With that in mind, let's see what you can do. First of all, if you do not introduce special tools into battle, you are limited by what mb_detect_encoding can do for you. Unfortunately, this is not so much. The mb_detect_order sister function mb_detect_order says:
mbstring currently implements the following detection of encoding filters. If there is an incorrect sequence of bytes for the following encoding, encoding detection will fail.
UTF-8, UTF-7, ASCII, EUC-JP, SJIS, eucJP-win, SJIS-win, JIS, ISO-2022-JP.
For ISO-8859-X, mbstring always defines as ISO-8859-X.
For UTF-16, UTF-32, UCS2 and UCS4, encoding detection will not always succeed.
So, discounting Japanese encodings, you can mainly distinguish between UTF-8, UTF-7 and ASCII. You cannot detect ISO-8859-X, because any text will be "recognized" as any of these encodings, if you take it into account (i.e. you will have a 100% false positive rate - not good), but a group that includes utf -16 just not supported.
Unfortunately, the bad news doesn't end there. The encoding order also matters! Since text encoded in UTF-7 or ASCII is also valid for UTF-8, placing UTF-8 at the top of the candidate list ensures that the only result you are ever going to get is therefore to be avoided at all costs.
Since the default detection order depends on the php.ini parameter, you should definitely not rely on this and switch to a known state by setting your own detection order:
mb_detect_order('ASCII, UTF-8');
So, you can at least say if your text is ASCII or UTF-8, right? Oh no. Unless you specifically ask that when you say "UTF-8," you really understand that:
$valid_utf8 = "\xC2\xA2"; $invalid_utf8 = "\xC2\x00"; mb_detect_order('UTF-8'); echo mb_detect_encoding($valid_utf8); // "utf-8": correct echo mb_detect_encoding($invalid_utf8); // "utf-8": WTF?!?!?!
The problem is that if you don't pass true for the $strict parameter, detecting UTF-8 will be ... a little more optimistic.
What can you actually do with this thing?
This is as good as it gets - the correct way to detect encodings (plural can hardly be used here):
$valid_utf8 = "\xC2\xA2"; $invalid_utf8 = "\xC2\x00"; $ascii = "hello world"; mb_detect_order('ASCII, UTF-8'); echo mb_detect_encoding($valid_utf8, mb_detect_order(), true); // OK: "utf-8" echo mb_detect_encoding($invalid_utf8, mb_detect_order(), true); // OK: false echo mb_detect_encoding($ascii, mb_detect_order(), true); // OK: "ascii"
What can be done with invalid UTF-8 text?
If you have out-of-band information about this text, unfortunately nothing .
OK, this is not entirely true. There are several things you can do in practice:
- See if there is a specification at the beginning of the text. This probably won't happen, and even if mathematically you might mistakenly accept a single-byte encoding for Unicode, but it's worth it.
- See if it likes the UTF-16. If most of the even bytes have the same value, then you are most likely looking at UTF-16 LE. If this happens for most odd-numbered bytes, you are probably looking at UTF-16 BE. Unfortunately, in both cases you can never be sure.
- Suppose the text is in ISO-8859-X and performs a statistical analysis based on the known script properties that match this encoding to see if the result is close to what was expected. If this is close enough for some encodings in this class and for others, you can make an educated guess.