Perl - regular expression - position of the first asymmetric character

I want to find the position in the line where the regular expression ceases to match.

A simple example:

my $x = 'abcdefghijklmnopqrstuvwxyz'; $x =~ /gho/; 

In this example, I have to indicate the position of the character "h" because "h" matches, and "o" is the first character that has no relation.

I was thinking about using pos or $ - but this is not written in a failed match. Another solution would be to iteratively reduce the regex pattern to match, but it is very ugly and does not work on complex patterns.

EDIT:

Good for linguists: I'm sorry for my terrible explanation.

To clarify my situation: if you think of regular expression as a state machine, there is a point where testing is interrupted because the character does not fit. This moment is what I am looking for.

Using iterative parenterase (as mentioned in eugene y) is a good idea, but it does not work with quantifiers, and I had to edit the template.

Are there any other ideas?

+4
source share
5 answers

What you offer is difficult but doable .

If I can rephrase what I understand, you want to find out how far the bad match ended up in the match. To do this, you need to parse the regular expression.

The best regexp parser is probably to use Perl itself with the -re=debug command line:

 $ perl -Mre=debug -e'"abcdefghijklmnopqr"=~/gh[ijkl]{5}/' Compiling REx "gh[ijkl]{5}" Final program: 1: EXACT <gh> (3) 3: CURLY {5,5} (16) 5: ANYOF[il][] (0) 16: END (0) anchored "gh" at 0 (checking anchored) minlen 7 Guessing start of match in sv for REx "gh[ijkl]{5}" against "abcdefghijklmnopqr" Found anchored substr "gh" at offset 6... Starting position does not contradict /^/m... Guessed: match at offset 6 Matching REx "gh[ijkl]{5}" against "ghijklmnopqr" 6 <bcdef> <ghijklmnop> | 1:EXACT <gh>(3) 8 <defgh> <ijklmnopqr> | 3:CURLY {5,5}(16) ANYOF[il][] can match 4 times out of 5... failed... Match failed Freeing REx: "gh[ijkl]{5}" 

You can lay out this Perl command line with your regular expression and parse the return of stdout. Find `

Here is a suitable regex:

 $ perl -Mre=debug -e'"abcdefghijklmnopqr"=~/gh[ijkl]{3}/' Compiling REx "gh[ijkl]{3}" Final program: 1: EXACT <gh> (3) 3: CURLY {3,3} (16) 5: ANYOF[il][] (0) 16: END (0) anchored "gh" at 0 (checking anchored) minlen 5 Guessing start of match in sv for REx "gh[ijkl]{3}" against "abcdefghijklmnopqr" Found anchored substr "gh" at offset 6... Starting position does not contradict /^/m... Guessed: match at offset 6 Matching REx "gh[ijkl]{3}" against "ghijklmnopqr" 6 <bcdef> <ghijklmnop> | 1:EXACT <gh>(3) 8 <defgh> <ijklmnopqr> | 3:CURLY {3,3}(16) ANYOF[il][] can match 3 times out of 3... 11 <ghijk> <lmnopqr> | 16: END(0) Match successful! Freeing REx: "gh[ijkl]{3}" 

You will need to create a parser that can handle returns from the Perl re debugger. The left and right angle brackets show the distance to the line when the regex engine tries to combine.

This is not a simple btw project ...

+4
source

You can get the corresponding part and use the index function to find its position:

 my $x = 'abcdefghijklmnopqrstuvwxyz'; $x =~ /(g(h(o)?)?)/; print index($x, $1) + length($1), "\n"; #8 
+4
source

It seems to work. Basically, the idea is to divide the regular expression into its component parts and try them sequentially, returning the last suitable position. Fixed strings need to be split, but character classes and quantifiers can be stored together.

In theory, this should work, but tuning may be required.

 use v5.10; use strict; use warnings; my $string = 'abcdefghijklmnopqrstuvwxyz'; my $match = partial_match($string, qw(gh (?=i) [ijkx]+ [lmn]+ z)); say "match ended at pos $match, character ", substr($string,$match,1); sub partial_match { my $string = shift; my @rx = @_; my $pos; if ($string =~ /$rx[0]/g) { $pos = pos $string; if (defined $rx[1]) { splice @rx, 0, 2, $rx[0] . $rx[1]; $pos = partial_match($string, @rx) // $pos; } else { return $pos } } else { say "Didn't match $rx[0]"; return; } } 
+1
source

What about:

 #!/usr/bin/perl use Modern::Perl; my $x = 'abcdefghijklmnopqrstuvwxyz'; my $s = 'gho'; do { if ($x =~ /$s/) { say "$s matches from $-[0] to $+[0]"; } else { say "$s doesn't match"; } } while chop $s; 

output:

 gho doesn't match gh matches from 6 to 8 g matches from 6 to 7 matches from 0 to 0 
0
source

I think this is exactly what the pos function is for. NOTE: pos only works if you use the /g flag

 my $x = 'abcdefghijklmnopqrstuvwxyz'; my $end = 0; if( $x =~ /$ARGV[0]/g ) { $end = pos($x); } print "End of match is: $end\n"; 

Gives the next exit

 [@centos5 ~]$ perl x.pl End of match is: 0 [@centos5 ~]$ perl x.pl def End of match is: 6 [@centos5 ~]$ perl x.pl xyz End of match is: 26 [@centos5 ~]$ perl x.pl aaa End of match is: 0 [@centos5 ~]$ perl x.pl ghi End of match is: 9 
0
source

All Articles