The word UTF8 contains a mixed Japanese and English character. How to determine which character is Japanese and English?

I have a UTF8 encoded string that contains Japanese and Roman characters. I want to determine which characters are Japanese and which are Roman? How to determine?

+7
source share
4 answers

You are looking for the Unicode property "Script". I recommend the ICU library.

From: http://icu-project.org/apiref/icu4c/uscript_8h.html

UScriptCode uscript_getScript (UChar32 codepoint, UErrorCode *err) Gets the script code associated with the given codepoint. 

As a result, the symbol of the script symbol will be displayed. Here are some of the returned constants:

  • USCRIPT_JAPANESE (Not sure if in this category ...)
  • USCRIPT_HIRAGANA (Japanese kana)
  • USCRIPT_KATAKANA (Japanese kana)
  • USCRIPT_HAN (Japanese Kanji)
  • USCRIPT_LATIN
  • USCRIPT_COMMON (spaces and punctuation marks that are common to all scripts)

LibICU is available for Java, C and C ++. You will need to parse the Unicode code to use this feature.

Alternative: You can also use Unicode regex, although very few engines support this syntax (Perl does ...). This PCRE will match lines of text that is definitely Japanese, but it will not get everything.

 /\p{Katakana,Hiragana,Han}+/ 

You have to be careful when you parse these things, because the Japanese text often includes romaji or numbers. A look at ja.wikipedia.org will quickly confirm this.

+7
source

You can define a Unicode category in Java with Character.getType () . For the Japanese, it will be Lo, for the Latin characters Ll, ​​Lu.

+6
source

Of the Unicode codes , Japanese characters can be Hiragana, Katakana, and Ideographs. These sets have defined start and end positions so you can create a function that checks if a character is within these limits.

 bool isJapanese(wchar_t w) { // Hiragana... if (w >= 0x3041 && w <= 0x309F) return true; // Do the same for the other sets ... return false; } 

Similarly, you can implement the isRoman function ...

+2
source

If you don’t need precision, just check the first byte of each UTF-8 sequence: if the sequence is & lt; = 2 (i.e. the first byte <= 0xDF), assume that the characters are roman, otherwise japanese.

Personally, I probably just used Perl .

+1
source

All Articles