You are looking for the Unicode property "Script". I recommend the ICU library.
From: http://icu-project.org/apiref/icu4c/uscript_8h.html
UScriptCode uscript_getScript (UChar32 codepoint, UErrorCode *err) Gets the script code associated with the given codepoint.
As a result, the symbol of the script symbol will be displayed. Here are some of the returned constants:
- USCRIPT_JAPANESE (Not sure if in this category ...)
- USCRIPT_HIRAGANA (Japanese kana)
- USCRIPT_KATAKANA (Japanese kana)
- USCRIPT_HAN (Japanese Kanji)
- USCRIPT_LATIN
- USCRIPT_COMMON (spaces and punctuation marks that are common to all scripts)
LibICU is available for Java, C and C ++. You will need to parse the Unicode code to use this feature.
Alternative: You can also use Unicode regex, although very few engines support this syntax (Perl does ...). This PCRE will match lines of text that is definitely Japanese, but it will not get everything.
/\p{Katakana,Hiragana,Han}+/
You have to be careful when you parse these things, because the Japanese text often includes romaji or numbers. A look at ja.wikipedia.org will quickly confirm this.
Dietrich epp
source share