Perl regular regex suffocating multiple instances of character sets

Question

Perl regular regex suffocating multiple instances of character sets

I started with some crazy crashes using preg_replace in php, and welded it until the problem case, when you had more than one character class, using italics "i" and "un" "no" together. Here is a simple test case in php:

<?php echo 'match single normal i: '; $str = 'mi'; echo (preg_match('!m[ıi]!', $str)) ? "ok\n" : "fail\n"; echo 'match single undotted ı: '; $str = 'mı'; echo (preg_match('!m[ıi]!', $str)) ? "ok\n" : "fail\n"; echo 'match double normal i: '; $str = 'misir'; echo (preg_match('!m[ıi]s[ıi]r!', $str)) ? "ok\n" : "fail\n"; echo 'match double undotted ı: '; $str = 'mısır'; echo (preg_match('!m[ıi]s[ıi]r!', $str)) ? "ok\n" : "fail\n"; ?>

And the same test again in perl:

 #!/usr/bin/perl $str = 'mi'; $str =~ m/m[ıi]/ && print "match single normal i\n"; $str = 'mı'; $str =~ m/m[ıi]/ && print "match single undotted ı\n"; $str = 'misir'; $str =~ m/m[ıi]s[ıi]r/ && print "match double normal i\n"; $str = 'mısır'; $str =~ m/m[ıi]s[ıi]r/ && print "match double undotted ı\n";

The first three tests work fine. The latter does not match.

Why does this work fine as a character class once, but not a second time in a single expression? How to write an expression suitable for such a word, which should correspond regardless of what letter combinations are written?

Edit: Background language problem I'm trying to execute a program.

Edit 2: Adding the use utf8; directive use utf8; eliminates the perl version. Since my original problem was a php program, and I just switched to perl to check if this is a bug in php, this does not help me much. Does anyone know a directive so that PHP doesn't strangle it?

+4

php regex perl unicode turkish

Caleb Nov 22 '10 at 20:49

source share

2 answers

You may need to tell Perl that your source file contains utf8 characters. Try:

 #!/usr/bin/perl use utf8; # **** Add this line $str = 'mısır'; $str =~ m/m[ıi]s[ıi]r/ && print "match double undotted ı\n";

Which does not help you with PHP, but there may be a similar directive in PHP. Otherwise, try using some form of escape sequence to avoid placing a literal character in the source code. I don't know anything about PHP, so I can't help it.

Edit
I read that PHP does not support Unicode. Thus, the Unicode input that you pass in is most likely treated as a string of bytes that Unicode encoded as.

If you can be sure that your input comes in as utf-8, then you can match for the utf-8 sequence for ı , which is \xc4 \xb1 , as in:

 $str = 'mısır'; # Make sure this source-file is encoded as utf-8 or this match will fail echo (preg_match('!m(i|\xc4\xb1)s(i|\xc4\xb1)r!', $str)) ? "ok\n" : "fail\n";

It works?

Change again:
I can explain why the first three tests pass. Suppose in your encoding ı encoded as ABCDE . then PHP sees the following:

 echo 'match single normal i: '; $str = 'mi'; echo (preg_match('!m[ABCDEi]!', $str)) ? "ok\n" : "fail\n"; echo 'match single undotted ABCDE: '; $str = 'mABCDE'; echo (preg_match('!m[ABCDEi]!', $str)) ? "ok\n" : "fail\n"; echo 'match double normal i: '; $str = 'misir'; echo (preg_match('!m[ABCDEi]s[ABCDEi]r!', $str)) ? "ok\n" : "fail\n"; echo 'match double undotted ABCDE: '; $str = 'mABCDEsABCDEr'; echo (preg_match('!m[ABCDEi]s[ABCDEi]r!', $str)) ? "ok\n" : "fail\n";

which makes it obvious why the first three tests pass, and the last one fails. If you use the start / end anchor ^...$ , I think you will find that only the first test passes.

+8

Adrian pronk Nov 22 '10 at 20:55

source share

tchrist · Accepted Answer · 2010-11-22T21:54:37+0000

Multibyte sequences will not do what you want in char brackets if UTF-8 is not correctly interpreted as a sequence of 8-bit bytes. I'm thinking about it. If [nñm] misinterpreted not as three logical characters, but as four physical bytes, you should only match a character with a code point of 6E or C3 or B1 or 6D.

For some purposes, you may need to rewrite [nñm] as (?:n|ñ|m) . It just depends on what you do. Case material does not work.

In addition, Unicode has special break-in rules for Turkish unceremonious i.

It looks like PHP is just not modern enough. Sigh.

Perl regular regex suffocating multiple instances of character sets

More articles: