EDIT: I just noticed that you really didn’t specify which pattern matching language you used. Well, I hope the Perl solution will work for you, as the necessary mechanics are likely to be very tough in any other language. Plus, if you are comparing with Unicode, Perl is really the best choice for this particular job.
If the appropriate template is set for the $rx variable below, this is a small piece of Perl code:
my $data = "foo1 and Πππ 語語語 done"; while ($data =~ /($rx)/g) { print "Got string: '$1'\n"; }
Generates this output:
Got string: 'foo1 and ' Got string: 'Πππ ' Got string: '語語語 ' Got string: 'done'
That is, it pulls out a Latin string, a Greek string, a Han string, and another Latin string. It's pretty damn closed by what I think you really need.
The reason I did not post this post yesterday is because I get weird kernels. Now I know why.
My solution uses lexical variables inside the construct (??{...}) . It turns out that this is unstable until v5.17.1, and at best only works by accident. It does not work on v5.17.0, but succeeds in v5.18.0 RC0 and RC2. So, Ive added use v5.17.1 to make sure you use something recent enough to trust this approach.
First, I decided that you really do not want to run all the same script types; you wanted to run all the same script type plus General and Inherited. Otherwise, you will be spoiled by punctuation, spaces and numbers for Common, and combining characters for Inherited. I really don’t think you want them to interrupt your “still script” run, but if you do, it’s easy to stop considering it.
So what we do is look at the first character, which has a script type different from Common or Inherited. Moreover, we extract from it what this type of script is actually, and use this information to build a new template, which is any number of characters, the type of which the script is either common, inherited, or any type of script. just found and saved. Then we evaluate this new template and continue.
Hey, I said it was hairy, wasn't it?
In the Im program that is about to show, Ive left in some commented out debug statements that show what exactly does. If you uncomment them, you will get this result for the last run, which should help to understand the approach:
DEBUG: Got peekahead character f, U+0066 DEBUG: Scriptname is Latin DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Latin}]*} Got string: 'foo1 and ' DEBUG: Got peekahead character Π, U+03a0 DEBUG: Scriptname is Greek DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Greek}]*} Got string: 'Πππ ' DEBUG: Got peekahead character 語, U+8a9e DEBUG: Scriptname is Han DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Han}]*} Got string: '語語語 ' DEBUG: Got peekahead character d, U+0064 DEBUG: Scriptname is Latin DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Latin}]*} Got string: 'done'
And finally, this is a big hairy deal:
use v5.17.1; use strict; use warnings; use warnings FATAL => "utf8"; use open qw(:std :utf8); use utf8; use Unicode::UCD qw(charscript);
Yes, there should be a better way. I don’t think there is more.
So enjoy it.