I can offer some insight, but it's hard to say if my answer will be "useful." Firstly, I only speak and read English, so I obviously do not speak or read Chinese. I really am the author of RegexKitLite , which is an Objective-C wrapper around the ICU regex engine. This is obviously not perl
:).
Regardless, the ICU regex engine has a feature that sounds amazingly like what you are trying to do. In particular, the ICU regular expression mechanism contains the parameter modifier UREGEX_UWORD
, which can be dynamically turned on using the usual syntax (?w:...)
. This modifier performs the following action:
Controls the behavior of \ b in the pattern. If set, word boundaries are found according to the definitions of a word found in Unicode UAX 29, Text Boundaries. By default, word boundaries are identified using a simple classification of characters as “word” or “non-word,” which approximates the traditional behavior of a regular expression. The results obtained with two parameters can be completely different in space runs and other non-word characters.
You can use this in a regular expression, for example (?w:\b(.*?)\b)
, to “extract” words from a string. In the ICU regular expression engine, it has a rather powerful word break mechanism, specially designed to search for word breaks in written languages that do not have an explicit space character, for example, in English. Again, without reading or writing these languages, I understand that itisroughlysomethinglikethis. ICU word break mechanism uses heuristics and sometimes dictionaries to find word breaks. As far as I understand, the Thai case is especially complicated. In fact, I use ฉันกินข้าว
(Thai for "I eat rice," or so I was told) with the regular expression (?w)\b\s*
to perform a split
operation on a string to extract words. Without (?w)
you cannot divide into word breaks. With (?w)
this leads to the words ฉัน
, กิน
and ข้าว
.
If the above “sounds like a problem you are facing”, then this may be the reason. If so, then I don’t know how to do this in perl
, but I would not consider this opinion an authoritative answer, since I use the ICU regular expression mechanism more often than perl
alone and I am clearly not properly motivated to find a working solution perl
when i already have one :). Hope this helps.