How can I run a regex that checks text for characters in a separate alphabet or script?

I would like to create a regular expression in Perl that will check the string for characters in a specific script. It will be something like this:

$text =~ .*P{'Chinese'}.* 

Is there an easy way to do this, for English it is quite easy, just testing for [a-zA-Z], but for a script like Chinese or one of the Japanese scripts, I can’t understand from any way to do this without writing out each character explicitly, which would do for a very ugly code. Ideas? I cannot be the first / only person who wanted to do this.

+7
source share
2 answers

See perldoc perluniprops for an exhaustive list of properties that you can use with \p . You will be interested in \p{CJK_Unified_Ideographs} and related properties, such as \p{CJK_Symbols_And_Punctuation} . \p{Hiragana} and \p{Katakana} give you Kana. There is also the \p{Script=...} property for several scripts: \p{Han} and \p{Script=Han} correspond to Han characters (Chinese), but there is no corresponding \p{Script=Japanese} , simply because that the Japanese have several scenarios.

+9
source

There are two ways to do this. By block ( \p{Block=...} ) and script ( \p{Script=...} ). The latter is probably more natural.

I don't know much about Chinese, but I think you want \p{Script=Han} aka \p{Han} for the Chinese.

The Japanese use three scripts:

  • Kanij: \p{Script=Han} aka \p{Han}
  • Hiragana: \p{Script=Hiragana} aka \p{Hiragana} aka \p{Hira}
  • Katakana: \p{Script=Katakana} aka \p{Katakana} aka \p{Kana}

You can look at perluniprops to find the one you are looking for, or you can use uniprops * to find which properties match a particular character.

 $ uniprops 4E2D U+4E2D ‹中› \N{CJK UNIFIED IDEOGRAPH-4E2D} \w \pL \p{L_} \p{Lo} All Any Alnum Alpha Alphabetic Assigned InCJK_UnifiedIdeographs CJK_Unified_Ideographs L Lo Gr_Base Grapheme_Base Graph GrBase Han Hani ID_Continue IDC ID_Start IDS Ideo Ideographic Letter L_ Other_Letter Print UIdeo Unified_Ideograph Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word 

To find out what characters are in a given property, you can use unichars *. (This has limited usefulness, as most CJK characters are not named.)

 $ unichars -au '\p{Han}' ⺀ U+2E80 CJK RADICAL REPEAT ⺁ U+2E81 CJK RADICAL CLIFF ⺂ U+2E82 CJK RADICAL SECOND ONE ⺃ U+2E83 CJK RADICAL SECOND TWO ⺄ U+2E84 CJK RADICAL SECOND THREE ⺅ U+2E85 CJK RADICAL PERSON ⺆ U+2E86 CJK RADICAL BOX ⺇ U+2E87 CJK RADICAL TABLE ⺈ U+2E88 CJK RADICAL KNIFE ONE ... 

* - uniprops and unichars are available from Unicode :: Tussle .

+4
source

All Articles