___ UTF-8 encoding - is there a solution for everyone?

I looked over the Internet, I looked at SO, through the PHP documentation and more.

It seems a ridiculous question to not have a standard solution. If you get an unknown character set and it has strange characters (like English quotation marks), is there a standard way to convert them to UTF-8?

I saw a lot of messy solutions using lots of features and checks, and none of them would definitely work.

Has anyone come up with their own function or solution that always works?


EDIT

Many people replied that "this is not solvable" or something like that. I understand that now, but no one has given any solution that worked besides utf8_encode , which is very limited. What methods exist to handle this? What is the best method?

+4
source share
4 answers

The reason you saw so many complex solutions to this problem is because, by definition, it is not solvable. The process of encoding a string of text is not deterministic. You can create different combinations of text and encoding that lead to the same stream of bytes. Therefore, strictly logically, it is impossible to determine the encoding, character set and text from the byte stream.

In fact, it is possible to achieve results that are quite โ€œcloseโ€ using heuristic methods, since there is a finite set of encodings that you will find in the wild, and with a sufficiently large sample, the program can most likely determine the encoding. Whether the results are good enough depends on the application.

I want to comment on a question about user data. All data published from the web page has a known encoding (POST comes with the encoding that the developer defined for this page). If the user inserts text into the form field, the browser interprets the text based on the encoding of the source data (as the operating system knows) and the encoding of the page and, if necessary, transcodes them. It is too late to detect the encoding on the server - because the browser may have changed the byte stream based on the intended encoding.

For example, if I type the letter ร„ on my German keyboard and place it on the UTF-8 encoded page, 2 bytes (xC3 x84) will be sent to the server. This is a valid EBCDIC string that represents the letters C and d. It is also a valid ANSI string, which represents 2 รƒ and "characters. However, no matter what I try to do, I donโ€™t need to insert an ANSI encoded string into the browser form and expect it to be interpreted as UTF-8 - since the operating system knows that I paste ANSI (I copied the text from Textpad, where I created a text file with ANSI encoding) and transcodes it to UTF-8, the result is a byte stream xC3 x83 xE2 x80 x9E.

I want to say that if the user manages to remove the garbage, perhaps because he was already garbage at the time he was inserted into the browser form, because the client did not have proper character set support, encoding, no difference. Since character encoding is not deterministic, you cannot expect that there is a trivial method to detect this situation.

Unfortunately, the problem remains for the downloaded files. The only reliable solution that I see is to show the user a section of the file and ask if it is interpreted correctly, and loop through a lot of different encodings until this happens.

Or we could develop a heuristic method that takes into account the appearance of certain characters in different languages. Let's say I uploaded my text file containing two xC3 x84 bytes. There is no other information - just two bytes in the file. This method could detect that the letter ร„ is quite common in the German text, but the letters รƒ and "together are unusual in any language and thus determine that the encoding of my file is indeed UTF-8. This is a crude level of difficulty that must be dealt with such a heuristic method, and the more statistical and linguistic facts it can use, the more reliable its results will be.

+9
source

Not. You always need to know which character the string specifies. Guessing the character set using the sniff function is unreliable (although in most situations, in the western world, it is usually a mixture between ISO-8859-1 and UTF-8).

But why do you have to deal with unknown character sets? There is no general solution for this, because the general problem should not exist in the first place. Each web page and data source can and should have a character set definition, and if not, you should ask the administrator of this resource to add it.

(Not to sound like smartass, but this is the only way to handle this.)

+11
source

Pekka is right about unreliability, but if you need a solution and you are ready to take a chance, and you have the mbstring library, this snippet should work:

 function forceToUtf8($string) { if (!mb_check_encoding($string)) { return false; } return mb_convert_encoding($string, 'UTF-8', mb_detect_encoding($string)); } 
+1
source

If I'm not mistaken, there is something called utf8encode ... it works well, EXCEPTION if you are already in utf8

http://php.net/manual/en/function.utf8-encode.php

0
source

Source: https://habr.com/ru/post/1312604/


All Articles