Perl is not a greedy problem

I have a problem with a non greedy regex. I saw that there are questions regarding the non-greedy regular expression, but they do not answer my problem.

Problem: I am trying to map href to the "lol" anchor.

Note. I know that this can be done with Perl HTML parsing, and my question is not about parsing HTML in perl. My question is about the regular expression itself, and HTML is just an example.

Test case: I have 4 tests for .*? and [^"] . Initially, results 2 give the expected result. However, the third does not, and the fourth just does, but I do not understand why.

Questions:

  • Why performs the third test in both tests for .*? and [^"] ? Should a non-greasy operator work?
  • Why performs the 4th test in both tests for .*? and [^"] ? I donโ€™t understand why the inclusion .* in front changes the regular expression. (3rd and 4th tests are the same except .* in front).

I probably don't understand how this regex works. A recipe for perl cookbook mentions something, but I don't think it answers my question.

 use strict; my $content=<<EOF; <a href="/hoh/hoh/hoh/hoh/hoh" class="hoh">hoh</a> <a href="/foo/foo/foo/foo/foo" class="foo">foo </a> <a href="/bar/bar/bar/bar/bar" class="bar">bar</a> <a href="/lol/lol/lol/lol/lol" class="lol">lol</a> <a href="/koo/koo/koo/koo/koo" class="koo">koo</a> EOF print "| $1 | \n\nThat ok\n" if $content =~ m~href="(.*?)"~s ; print "\n---------------------------------------------------\n"; print "| $1 | \n\nThat ok\n" if $content =~ m~href="(.*?)".*>lol~s ; print "\n---------------------------------------------------\n"; print "| $1 | \n\nWhy does not the 2nd non-greedy '?' work?\n" if $content =~ m~href="(.*?)".*?>lol~s ; print "\n---------------------------------------------------\n"; print "| $1 | \n\nIt now works if I put the '.*' in the front?\n" if $content =~ m~.*href="(.*?)".*?>lol~s ; print "\n###################################################\n"; print "Let try now with [^]"; print "\n###################################################\n\n"; print "| $1 | \n\nThat ok\n" if $content =~ m~href="([^"]+?)"~s ; print "\n---------------------------------------------------\n"; print "| $1 | \n\nThat ok.\n" if $content =~ m~href="([^"]+?)".*>lol~s ; print "\n---------------------------------------------------\n"; print "| $1 | \n\nThe 2nd greedy still doesn't work?\n" if $content =~ m~href="([^"]+?)".*?>lol~s ; print "\n---------------------------------------------------\n"; print "| $1 | \n\nNow with the '.*' in front it does.\n" if $content =~ m~.*href="([^"]+?)".*?>lol~s ; 
+4
source share
4 answers

Try printing $& (text matching the entire regular expression) as well as $1 . This may give you a better idea of โ€‹โ€‹what is going on.

The problem you think is that .*? doesnโ€™t mean "Find a match from all possible matches that use the least number of characters." It just means "First, try matching 0 characters here and continue to match the rest of the regular expression. If that fails, try matching 1 character. If the rest of the regular expression doesn't match, try 2 characters here, etc."

Perl will always find a match that starts closer to the beginning of the line. Since most of your templates start with href= , it will find the first line href= in the line and see if there is a way to expand the reps to get the start of the match. If it cannot get a match, it will try to start with the next href= , etc.

When you add greedy .* the beginning of a regular expression, matching begins with .* , Capturing as many characters as possible. Perl then backs off to find href= . Essentially, this forces him to first try href= on the line and work at the beginning of the line.

+6
source

Only the 4th test case works.

first m~href="(.*?)"~s This will match the first href inside your line and grab what's between the quotes, like this: /hoh/hoh/hoh/hoh/hoh

Second: m~href="(.*?)".*>lol~s This will match the first href inside your line and grab whatever is between quotes, and then match any number of characters until it finds >lol like this: /hoh/hoh/hoh/hoh/hoh

Try to take .* With m~href="(.*?)"(.*)>lol~s

 $1 contains : /hoh/hoh/hoh/hoh/hoh $2 contains : class="hoh">hoh</a> <a href="/foo/foo/foo/foo/foo" class="foo">foo </a> <a href="/bar/bar/bar/bar/bar" class="bar">bar</a> <a href="/lol/lol/lol/lol/lol" class="lol" 

third: m~href="(.*?)".*?>lol~s The same result as the previous test example.

Fourth: m~.*href="(.*?)".*?>lol~s This will match any number of any character, then href=" , and then commit any number of characters that are not greedy for a quote, and then match any number of characters until it finds >lol like this: /lol/lol/lol/lol/lol

Try to capture everything .* With m~(.*)href="(.*?)"(.*?)>lol~s

 $1 contains : <a href="/hoh/hoh/hoh/hoh/hoh" class="hoh">hoh</a> <a href="/foo/foo/foo/foo/foo" class="foo">foo </a> <a href="/bar/bar/bar/bar/bar" class="bar">bar</a> <a $2 contains : /lol/lol/lol/lol/lol $3 contains : class="lol" 

Check out this site for an explanation of what your regular expressions do.

0
source

The main problem is that you use non-greedy regular expressions when you do not need it. The second problem is to use. with *, which may accidentally match more than you intended. The s flag you use does. even more matching.

Using:

 m~href="([^"]+)"[^>]*>lol~ 

for your business. And about inanimate regular expressions, consider this code:

 $_ = "xaaaaab xaaac xbbc"; m~^x.+?c~; 

It will not match xaaac, as you would expect, it will start at the beginning of the line and match xaaaaab xaaac. The greedy option will fit the entire line.

The bottom line is that although non-greedy regular expressions are not trying to capture as much as possible, they are still trying to somehow match the same zeal as their greedy brothers. And they will capture any part of the string to do this.

You can also consider a possessive quantifier that disables backtracking. Cookbooks are also good places to start, but if you want to understand how things work, you should read this - perlre

0
source

Let me illustrate what happens here (see other answers why this happens):

href="(.*?)"

Match: href="/hoh/hoh/hoh/hoh/hoh" Group: Moderators

href="(.*?)".*>lol

Match: href="/hoh/hoh/hoh/hoh/hoh" class="hoh">hoh</a> <a href="/foo/foo/foo/foo/foo" class="foo">foo </a> <a href="/bar/bar/bar/bar/bar" class="bar">bar</a> <a href="/lol/lol/lol/lol/lol" class="lol">lol

Group: /hoh/hoh/hoh/hoh/hoh

href="([^"]+?)".*?>lol

Match: href="/hoh/hoh/hoh/hoh/hoh" class="hoh">hoh</a> <a href="/foo/foo/foo/foo/foo" class="foo">foo </a> <a href="/bar/bar/bar/bar/bar" class="bar">bar</a> <a href="/lol/lol/lol/lol/lol" class="lol">lol

Group: /hoh/hoh/hoh/hoh/hoh

.*href="(.*?)".*?>lol

Match: <a href="/hoh/hoh/hoh/hoh/hoh" class="hoh">hoh</a> <a href="/foo/foo/foo/foo/foo" class="foo">foo </a> <a href="/bar/bar/bar/bar/bar" class="bar">bar</a> <a href="/lol/lol/lol/lol/lol" class="lol">lol

Group: /lol/lol/lol/lol/lol

In one of the ways to write a regular expression you want to use: href="[^"]*"[^>]*>lol

0
source

All Articles