Make sure the string is UTF-8 encoded

In my application, I read the csv file and display the contents for the user. But there is a problem with the encoding.

I have two csv files example1.csv and example2.csv . I opened both in notepad ++, which shows the ANSI encoding for example1 and UTF-8 without the specification for example2.

Firstly, I tried the mb_detect_encoding encoding detection function, but it shows me UTF-8 in both cases, which is wrong.

Secondly, I am trying to convert the contents of a file to UTF-8 using utf8_encode . This works for an ANSI file. But for UTF-8 without a BOM file, it seems that it was encoded back to ANSI. It displays Γƒ instead of German ß . The same goes for other special characters.

I want to make sure that the content is always in UTF-8 format before being displayed or processed. So what am I doing wrong?


This is how I use the mb_detect_encoding function:

$file_content = file_get_contents($_FILES['file']['tmp_name']); die(var_dump( mb_detect_encoding($file_content))); 

and he prints UTF-8 for both examples.

+6
source share
2 answers

Q: another inconvenient truth

Unable to detect encoding of unknown text with 100% accuracy and / or certainty.

In practice, there will be cases across the entire spectrum of possible results: you can be sure that the multilingual text in UTF-8 will be correctly detected as such, while it is impossible to determine which of the ISO family Encodings -8859 correspond to some text - and if you if you don’t want to do statistical analysis, it’s even impossible to make an educated guess!

What do we need to work with?

With that in mind, let's see what you can do. First of all, if you do not introduce special tools into battle, you are limited by what mb_detect_encoding can do for you. Unfortunately, this is not so much. The mb_detect_order sister function mb_detect_order says:

mbstring currently implements the following detection of encoding filters. If there is an incorrect sequence of bytes for the following encoding, encoding detection will fail.

UTF-8, UTF-7, ASCII, EUC-JP, SJIS, eucJP-win, SJIS-win, JIS, ISO-2022-JP.

For ISO-8859-X, mbstring always defines as ISO-8859-X.

For UTF-16, UTF-32, UCS2 and UCS4, encoding detection will not always succeed.

So, discounting Japanese encodings, you can mainly distinguish between UTF-8, UTF-7 and ASCII. You cannot detect ISO-8859-X, because any text will be "recognized" as any of these encodings, if you take it into account (i.e. you will have a 100% false positive rate - not good), but a group that includes utf -16 just not supported.

Unfortunately, the bad news doesn't end there. The encoding order also matters! Since text encoded in UTF-7 or ASCII is also valid for UTF-8, placing UTF-8 at the top of the candidate list ensures that the only result you are ever going to get is therefore to be avoided at all costs.

Since the default detection order depends on the php.ini parameter, you should definitely not rely on this and switch to a known state by setting your own detection order:

 mb_detect_order('ASCII, UTF-8'); // I left UTF-7 out, but who cares? 

So, you can at least say if your text is ASCII or UTF-8, right? Oh no. Unless you specifically ask that when you say "UTF-8," you really understand that:

 $valid_utf8 = "\xC2\xA2"; $invalid_utf8 = "\xC2\x00"; mb_detect_order('UTF-8'); echo mb_detect_encoding($valid_utf8); // "utf-8": correct echo mb_detect_encoding($invalid_utf8); // "utf-8": WTF?!?!?! 

The problem is that if you don't pass true for the $strict parameter, detecting UTF-8 will be ... a little more optimistic.

What can you actually do with this thing?

This is as good as it gets - the correct way to detect encodings (plural can hardly be used here):

 $valid_utf8 = "\xC2\xA2"; $invalid_utf8 = "\xC2\x00"; $ascii = "hello world"; mb_detect_order('ASCII, UTF-8'); echo mb_detect_encoding($valid_utf8, mb_detect_order(), true); // OK: "utf-8" echo mb_detect_encoding($invalid_utf8, mb_detect_order(), true); // OK: false echo mb_detect_encoding($ascii, mb_detect_order(), true); // OK: "ascii" 

What can be done with invalid UTF-8 text?

If you have out-of-band information about this text, unfortunately nothing .

OK, this is not entirely true. There are several things you can do in practice:

  • See if there is a specification at the beginning of the text. This probably won't happen, and even if mathematically you might mistakenly accept a single-byte encoding for Unicode, but it's worth it.
  • See if it likes the UTF-16. If most of the even bytes have the same value, then you are most likely looking at UTF-16 LE. If this happens for most odd-numbered bytes, you are probably looking at UTF-16 BE. Unfortunately, in both cases you can never be sure.
  • Suppose the text is in ISO-8859-X and performs a statistical analysis based on the known script properties that match this encoding to see if the result is close to what was expected. If this is close enough for some encodings in this class and for others, you can make an educated guess.
+9
source

To check utf8 do something like this

 if (mb_check_encoding(file_get_contents($file), 'UTF-8')) { // yup, all UTF-8 } 
-1
source

All Articles