Regular expression to match borders between different Unicode scripts

Question

Regular expression to match borders between different Unicode scripts

Regular expression engines have the concept of zero-width matches, some of which are useful for finding word edges:

\b - present in most engines to match any border between words and symbols without a word
\< and \> - is present in Vim to correspond only to the border at the beginning of the word and at the end of the word, respectively.

A newer concept on some regex machines is the Unicode classes. One such class is a script that can distinguish between Latin, Greek, Cyrillic, etc. These examples are equivalent and correspond to any character of the Greek writing system:

\p{greek}
\p{script=greek}
\p{script:greek}
[:script=greek:]
[:script:greek:]

But while I was reading sources on regular expressions and Unicode, I was not able to determine if there was any standard or non-standard way to achieve a zero width match where one script ends and another starts.

The string παν語 must have a match between the characters ν and 語 , just as \b and \< will coincide immediately before the character π .

Now for this example, I could hack something together based on the search \p{greek} , followed by \p{Han} , and I could even hack something together based on all possible combinations of two Unicode scripts.

But this would not be a deterministic solution, as new scripts are still added to Unicode with each version. Is there any reliable way to express this? Or is there a suggestion to add it?

+8

regex unicode character-properties word-boundary

hippietrail May 11 '13 at 1:39

source share

1 answer

tchrist · Accepted Answer · 2013-05-14T00:14:40+0000

EDIT: I just noticed that you really didn’t specify which pattern matching language you used. Well, I hope the Perl solution will work for you, as the necessary mechanics are likely to be very tough in any other language. Plus, if you are comparing with Unicode, Perl is really the best choice for this particular job.

If the appropriate template is set for the $rx variable below, this is a small piece of Perl code:

 my $data = "foo1 and Πππ 語語語 done"; while ($data =~ /($rx)/g) { print "Got string: '$1'\n"; }

Generates this output:

 Got string: 'foo1 and ' Got string: 'Πππ ' Got string: '語語語 ' Got string: 'done'

That is, it pulls out a Latin string, a Greek string, a Han string, and another Latin string. It's pretty damn closed by what I think you really need.

The reason I did not post this post yesterday is because I get weird kernels. Now I know why.

My solution uses lexical variables inside the construct (??{...}) . It turns out that this is unstable until v5.17.1, and at best only works by accident. It does not work on v5.17.0, but succeeds in v5.18.0 RC0 and RC2. So, Ive added use v5.17.1 to make sure you use something recent enough to trust this approach.

First, I decided that you really do not want to run all the same script types; you wanted to run all the same script type plus General and Inherited. Otherwise, you will be spoiled by punctuation, spaces and numbers for Common, and combining characters for Inherited. I really don’t think you want them to interrupt your “still script” run, but if you do, it’s easy to stop considering it.

So what we do is look at the first character, which has a script type different from Common or Inherited. Moreover, we extract from it what this type of script is actually, and use this information to build a new template, which is any number of characters, the type of which the script is either common, inherited, or any type of script. just found and saved. Then we evaluate this new template and continue.

Hey, I said it was hairy, wasn't it?

In the Im program that is about to show, Ive left in some commented out debug statements that show what exactly does. If you uncomment them, you will get this result for the last run, which should help to understand the approach:

 DEBUG: Got peekahead character f, U+0066 DEBUG: Scriptname is Latin DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Latin}]*} Got string: 'foo1 and ' DEBUG: Got peekahead character Π, U+03a0 DEBUG: Scriptname is Greek DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Greek}]*} Got string: 'Πππ ' DEBUG: Got peekahead character 語, U+8a9e DEBUG: Scriptname is Han DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Han}]*} Got string: '語語語 ' DEBUG: Got peekahead character d, U+0064 DEBUG: Scriptname is Latin DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Latin}]*} Got string: 'done'

And finally, this is a big hairy deal:

 use v5.17.1; use strict; use warnings; use warnings FATAL => "utf8"; use open qw(:std :utf8); use utf8; use Unicode::UCD qw(charscript); # regex to match a string that all of the # same Script=XXX type # my $rx = qr{ (?= [\p{Script=Common}\p{Script=Inherited}] * (?<CAPTURE> [^\p{Script=Common}\p{Script=Inherited}] ) ) (??{ my $capture = $+{CAPTURE}; #####printf "DEBUG: Got peekahead character %s, U+%04x\n", $capture, ord $capture; my $scriptname = charscript(ord $capture); #####print "DEBUG: Scriptname is $scriptname\n"; my $run = q([\p{Script=Common}\p{Script=Inherited}\p{Script=) . $scriptname . q(}]*); #####print "DEBUG: string to re-interpolate as regex is q{$run}\n"; $run; }) }x; my $data = "foo1 and Πππ 語語語 done"; $| = 1; while ($data =~ /($rx)/g) { print "Got string: '$1'\n"; }

Yes, there should be a better way. I don’t think there is more.

So enjoy it.

Regular expression to match borders between different Unicode scripts

More articles: