The word UTF8 contains a mixed Japanese and English character. How to determine which character is Japanese and English?

Question

The word UTF8 contains a mixed Japanese and English character. How to determine which character is Japanese and English?

I have a UTF8 encoded string that contains Japanese and Roman characters. I want to determine which characters are Japanese and which are Roman? How to determine?

+7

java c ++ c

Hospeti Nov 17 '11 at 11:09

source share

4 answers

Dietrich epp · Answer 1 · 2011-11-17T11:22:07+0000

You are looking for the Unicode property "Script". I recommend the ICU library.

From: http://icu-project.org/apiref/icu4c/uscript_8h.html

UScriptCode uscript_getScript (UChar32 codepoint, UErrorCode *err) Gets the script code associated with the given codepoint.

As a result, the symbol of the script symbol will be displayed. Here are some of the returned constants:

USCRIPT_JAPANESE (Not sure if in this category ...)
USCRIPT_HIRAGANA (Japanese kana)
USCRIPT_KATAKANA (Japanese kana)
USCRIPT_HAN (Japanese Kanji)
USCRIPT_LATIN
USCRIPT_COMMON (spaces and punctuation marks that are common to all scripts)

LibICU is available for Java, C and C ++. You will need to parse the Unicode code to use this feature.

Alternative: You can also use Unicode regex, although very few engines support this syntax (Perl does ...). This PCRE will match lines of text that is definitely Japanese, but it will not get everything.

 /\p{Katakana,Hiragana,Han}+/

You have to be careful when you parse these things, because the Japanese text often includes romaji or numbers. A look at ja.wikipedia.org will quickly confirm this.

mrembisz · Answer 2 · 2011-11-17T11:22:48+0000

You can define a Unicode category in Java with Character.getType () . For the Japanese, it will be Lo, for the Latin characters Ll, Lu.

pnezis · Answer 3 · 2011-11-17T11:22:58+0000

Of the Unicode codes , Japanese characters can be Hiragana, Katakana, and Ideographs. These sets have defined start and end positions so you can create a function that checks if a character is within these limits.

 bool isJapanese(wchar_t w) { // Hiragana... if (w >= 0x3041 && w <= 0x309F) return true; // Do the same for the other sets ... return false; }

Similarly, you can implement the isRoman function ...

Christoph · Answer 4 · 2011-11-17T11:18:44+0000

If you don’t need precision, just check the first byte of each UTF-8 sequence: if the sequence is & lt; = 2 (i.e. the first byte <= 0xDF), assume that the characters are roman, otherwise japanese.

Personally, I probably just used Perl .

The word UTF8 contains a mixed Japanese and English character. How to determine which character is Japanese and English?

More articles: