There are two ways to do this. By block ( \p{Block=...}
) and script ( \p{Script=...}
). The latter is probably more natural.
I don't know much about Chinese, but I think you want \p{Script=Han}
aka \p{Han}
for the Chinese.
The Japanese use three scripts:
- Kanij:
\p{Script=Han}
aka \p{Han}
- Hiragana:
\p{Script=Hiragana}
aka \p{Hiragana}
aka \p{Hira}
- Katakana:
\p{Script=Katakana}
aka \p{Katakana}
aka \p{Kana}
You can look at perluniprops to find the one you are looking for, or you can use uniprops
* to find which properties match a particular character.
$ uniprops 4E2D U+4E2D ‹中› \N{CJK UNIFIED IDEOGRAPH-4E2D} \w \pL \p{L_} \p{Lo} All Any Alnum Alpha Alphabetic Assigned InCJK_UnifiedIdeographs CJK_Unified_Ideographs L Lo Gr_Base Grapheme_Base Graph GrBase Han Hani ID_Continue IDC ID_Start IDS Ideo Ideographic Letter L_ Other_Letter Print UIdeo Unified_Ideograph Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word
To find out what characters are in a given property, you can use unichars
*. (This has limited usefulness, as most CJK characters are not named.)
$ unichars -au '\p{Han}' ⺀ U+2E80 CJK RADICAL REPEAT ⺁ U+2E81 CJK RADICAL CLIFF ⺂ U+2E82 CJK RADICAL SECOND ONE ⺃ U+2E83 CJK RADICAL SECOND TWO ⺄ U+2E84 CJK RADICAL SECOND THREE ⺅ U+2E85 CJK RADICAL PERSON ⺆ U+2E86 CJK RADICAL BOX ⺇ U+2E87 CJK RADICAL TABLE ⺈ U+2E88 CJK RADICAL KNIFE ONE ...
* - uniprops
and unichars
are available from Unicode :: Tussle .
ikegami
source share