How is it that two equally coded words look different in htmlentities?

Question

How is it that two equally coded words look different in htmlentities?

I have a question regarding UTF-8 and htmlentities. I have two variables with Greek text, both of them seem to be UTF-8 encoded (according to mb_detect_encoding ()). When I output two variables, they look exactly the same in the browser (also in the source code). I was amazed when I realized that the simple if($var1 == $var2) always failed, although they seemed to be exactly the same. So I used htmlentities to find out if the html code would be the same. I was surprised when I saw that the first variable looked like this: Ï ÎºÏ Î»Î¿Ï , and the other one was: ια&ro; . How can it be that two identical words with the same encoding (UTF-8), however, are different? And how can I fix this problem?

+4

php encoding utf-8

phpheini Dec 15 '12 at 21:29

source share

1 answer

Free radical · Accepted Answer · 2012-12-15T22:27:29+0000

Your first question: how can it be that two identical words with the same encoding (UTF-8), however, are different?

In this case, the encoding is actually not UTF-8 in both cases. The first variable is in “real” UTF-8, while in the second, Greek characters are not actually in UTF-8, but in ASCII with non-ASCII characters (Greek) encoded using something called CER ( Reference Entity Reference).

The web browser and some too friendly WYSIWYG editors will display these lines as identical, but the binary representations of the actual lines (this is what the computer compares) are different. This is why an equal test fails, even if the lines appear the same when visualizing a person in a browser or editor.

I don't think you can rely on mb_detect_encoding to detect encoding in such cases, since there is no way to tell utf-8 other than ASCII, using CER to represent non-ASCII.

Your second question: how can I fix this problem?

Before you can compare strings that can be encoded in different ways, you need to convert them to canonical form ( Wikipedia: Canonicalization ) so that their binary representation is identical.

Here is how I solved it: I implemented a convenient function called utf8_normalize , which converts almost any general representation of characters (in my case: CER, NER, iso-8859-1 and CP-1252) to canonical utf-8 before comparing strings. What you throw in there should be determined to some extent by what is a “popular” character representation in the type of environment your software will run in, but if you just make sure your lines are in canonical form before comparing it will work.

As noted in a comment below from OP (phpheini), there is also a PHP Normalizer class that can improve the normalization work of that homegrown function.

How is it that two equally coded words look different in htmlentities?

More articles: