Are utf8 uppercase characters the same number of bytes than their lower case?

Obviously, this is true for the Latin alphabet. But I ask for this in a conceptual sense, in different languages ​​and in the Unicode specification.

In practice, it came to compare two lines. If you already know that they do not have the same number of bytes in all languages, can you assume that there is enough guarantee that they are not "cased" versions of the same string?

+4
case-insensitive unicode utf-8
source share
2 answers

Not.

Consider U + 0069 "i", which has an octet value of 69 in UTF-8. In the capitalized form U + 0130 "Δ°", this code point forms the sequence UTF-8 C4 B0 .

Mandatory Note: The case is case sensitive.

+7
source share

There is no principle or invariant in the Unicode standard that guarantees this. I would be particularly concerned about accented capitals, where there may be a mismatch between pre-composition and failure to present all cases. However, I cannot give an example of a problem for you.

+5
source share

All Articles