Detect if character is simplified or traditional Chinese character

I found this question that gives me the opportunity to check if a string contains a Chinese character. I'm not sure if the unicode ranges are correct, but they seem to return false for Japanese and Korean and are true for Chinese.

What he does not do is say whether the symbol is traditional or simplified Chinese. How would you know that?


Update

Q: How can I find out from a 32-bit Unicode character value if it is a Chinese, Korean, or Japanese character?

http://unicode.org/faq/han_cjk.html

Their argument is that the characters, regardless of their form, have the same meaning and therefore must be represented by the same code. Well, this is not meaningless to me, because I analyze individual characters who do not work with their solution:

The best solution is to look at the text as a whole: if there is a sufficient amount of kana, it is probably Japanese, and if there will be quite a lot of Hangul, it is probably Korean.

+7
source share
3 answers

As I think you have found, you cannot. Simplified and traditional - these are just two styles of writing the same characters - it's like the difference between a Roman and a Gothic script for European languages.

+3
source

As already mentioned, you cannot reliably determine the style of a script from a single character, but this is possible for a sufficiently long sample of text. See https://github.com/jpatokal/script_detector for the Ruby gem doing this job and the Simplified Chinese Unicode table for a general discussion.

+3
source

Maybe for some characters. Traditional and simplified character sets overlap, so you basically have three character sets:

  • Symbols that are only traditional;
  • Simplified characters only;
  • Symbols that have remained untouched and are available in both.

Take the 面 symbol, for example. It refers to both # 2 and # 3 ... As a simplified symbol, it denotes and , face and noodles. While 麵 is only a traditional symbol. So in the database, Unihan 麵 has kSimplifiedVariant , which points to . Thus, you can subtract that it is only a traditional symbol.

But also has kTraditionalVariant , which points to . But here the system breaks: if you use this data to subtract that 面 is only a simplified symbol, you are mistaken ...

On the other hand, has kTraditionalVariant , pointing to , and the two are a “real” simplified / traditional pair. But nothing in the Unihan database is different from cases like 韓 韩 from cases like 麵面.

0
source

All Articles