Your first question: how can it be that two identical words with the same encoding (UTF-8), however, are different?
In this case, the encoding is actually not UTF-8 in both cases. The first variable is in “real” UTF-8, while in the second, Greek characters are not actually in UTF-8, but in ASCII with non-ASCII characters (Greek) encoded using something called CER ( Reference Entity Reference).
The web browser and some too friendly WYSIWYG editors will display these lines as identical, but the binary representations of the actual lines (this is what the computer compares) are different. This is why an equal test fails, even if the lines appear the same when visualizing a person in a browser or editor.
I don't think you can rely on mb_detect_encoding to detect encoding in such cases, since there is no way to tell utf-8 other than ASCII, using CER to represent non-ASCII.
Your second question: how can I fix this problem?
Before you can compare strings that can be encoded in different ways, you need to convert them to canonical form ( Wikipedia: Canonicalization ) so that their binary representation is identical.
Here is how I solved it: I implemented a convenient function called utf8_normalize , which converts almost any general representation of characters (in my case: CER, NER, iso-8859-1 and CP-1252) to canonical utf-8 before comparing strings. What you throw in there should be determined to some extent by what is a “popular” character representation in the type of environment your software will run in, but if you just make sure your lines are in canonical form before comparing it will work.
As noted in a comment below from OP (phpheini), there is also a PHP Normalizer class that can improve the normalization work of that homegrown function.
source share