Strange regexp problem in perl, alternate attempts match

Question

Strange regexp problem in perl, alternate attempts match

Consider the following perl script:

#!/usr/bin/perl my $str = 'not-found=1,total-found=63,ignored=2'; print "1. matched using regex\n" if ($str =~ m/total-found=(\d+)/g); print "2. matched using regex\n" if ($str =~ m/total-found=(\d+)/g); print "3. matched using regex\n" if ($str =~ m/total-found=(\d+)/g); print "4. matched using regex\n" if ($str =~ m/total-found=(\d+)/g); print "Bye!\n";

Result after launch:

 1. matched using regex 3. matched using regex Bye!

The same regular expression matches one and doesn't match right after. Any idea why alternate is trying to match the same string with the same regex in perl?

Thanks!

+4

regex perl

amitsaurav Apr 4 '13 at 17:45

source share

3 answers

Here is a long explanation why your code is not working.

The /g modifier changes the behavior of the regular expression to "global matching". This will match all occurrences of the pattern in the string. However, how this is done depends on the context. The two (main) contexts in Perl are the list context (plural) and the scalar context (singular).

In the context of the list, the global regular expression returns a list of all the substrings or a flat list of all matched captures:

 my $_ = "foobaa"; my $regex = qr/[aeiou]/; my @matches = /$regex/g; # match all vowels say "@matches"; # "ooaa"

In a scalar context, a match seems to return a perl boolean expression associated with a regular expression:

 my $match = /$regex/g; say $match; # "1" (on failure: the empty string)

However, the regular expression has turned into an iterator. Each time a regular expression is executed, the regular expression starts at the current position in the line and tries to match. If it matches, it returns true. If the match fails, then

a match returns false, and
the current position in the line is set to the beginning.

Since the line position was reset, the next match will be repeated again.

 my $match; say $match while $match = /$regex/g; say "The match returned false, or the while loop would have go on forever"; say "But we can match again" if /$regex/g;

The second effect - resetting the position - can be canceled with the additional flag /c .

Access to the position in the string can be obtained using the pos function: pos($string) returns the current position, which can be set as pos($string) = 0 .

A regular expression can also be bound to the \G statement at the current position, just as ^ binds the regular expression at the beginning of a line.

This m//gc style correspondence makes it easy to write a tokenizer:

 my @tokens; my $_ = "1, abc, 2 "; TOKEN: while(pos($_) < length($_)) { /\G\s+/gc and next; # skip whitespace # if one of the following matches fails, the next token is tried if (/\G(\d+)/gc) { push @tokens, [NUM => $1]} elsif (/\G,/gc ) { push @tokens, ['COMMA' ]} elsif (/\G(\w+)/gc) { push @tokens, [STR => $1]} else { last TOKEN } # break the loop only if nothing matched at this position. } say "[@$_]" for @tokens;

Output:

 [NUM 1] [COMMA] [STR abc] [COMMA] [NUM 2]

+3

amon Apr 4 '13 at 20:37

source share

  my $str = 'not-found=1,total-found=63,ignored=2'; print "1. matched using regex\n" if ($str =~ m/total-found=(\d+)/g);

matches total-found=63 and pos($str) , for the next match attempt, offset 26 is set.

  print "2. matched using regex\n" if ($str =~ m/total-found=(\d+)/g);

matches nothing , and for pos($str) , reset to offset 0.

That's why

  print "3. matched using regex\n" if ($str =~ m/total-found=(\d+)/g);

total-found=63 repeated again and pos($str) for the next retry attempt is again set to offset 26 and that

  print "4. matched using regex\n" if ($str =~ m/total-found=(\d+)/g);

will work again, like the second, restteing pos($str) for offset 0.

  print "Bye!\n";

+1

Jense Jul 22 '14 at 12:18

source share

Yley · Accepted Answer · 2013-04-04T17:51:33+0000

Get rid of m and g as modifiers for your regex, they don't do what you want.

 print "1. matched using regex\n" if ($str =~ /total-found=(\d+)/); print "2. matched using regex\n" if ($str =~ /total-found=(\d+)/); print "3. matched using regex\n" if ($str =~ /total-found=(\d+)/); print "4. matched using regex\n" if ($str =~ /total-found=(\d+)/);

In particular, m is optional in this context m/foo/ exactly the same as /foo/ . The real problem is that g does a bunch of things that you don't want in this context. See perlretut for more details.

Strange regexp problem in perl, alternate attempts match

More articles: