UTF-8 random change without language knowledge

Question

UTF-8 random change without language knowledge

I am trying to evaluate different strategies for comparing strings without restrictions on the UTF-8 index.

I read some materials from the Unicode consortium, experimented with ICU and tried to find various alternatives for implementation quality.

On several occasions, I saw that the texts differ between Simple Case Mapping and Full Case Mapping, and I wanted to make sure that I fully understood the difference.

As I read, Simple Case Mapping is “context-free”, that is, you don’t need to know which language this payload is in. This will give approximate results due to the Turkic "I / ı / İ / i" fiasco.

On the other hand, to fully display map data, you must know the language of the payload in order to be able to perform the mapping. With this additional information, he can take special measures to cover cases where “Kim” as a Turkic string should become “KİM” in upper case, but “Kim” as an English string should become “KIM” in upper case.

Do I have this right?

Are there other examples of “multifaceted” code points that add up differently for different languages?

Thanks!

UPDATE: . One source that mentions the simplest case mapping as language-independent is the ICU documentation . I interpreted this as Unicode truth, but maybe it's just an expression about implementation?

+5

c case-insensitive utf-8

Kim Gräsman 25 . '09 8:48

2

... "" "ss" , - "ß". "", , .

, ( , , ), , , .

+2

unwind 25 . '09 8:52

Hans Passant · Accepted Answer · 2009-11-25T17:19:21+0000

, " " , . .

, Unicode CaseFolding.txt , . "T", , .

UTF-8 random change without language knowledge

More articles: