Is it possible for you to find the smallest match when using greedy characters

Disclaimer: I am not a regular expression specialist.

I am using the Python re module to match regular expressions in many htm files. One of the templates looks something like this:

<bla><blabla>87765.*</blabla><bla> 

The problem I ran into is that instead of finding all (say) five occurrences of the pattern, it will find only one. Since it welds all occurrences into one, using the <bla><blabla>87765 first occurrence and the </blabla><bla> last occurrence on the page.

Is it possible to specify re to find the smallest match?

+4
source share
4 answers

You can use the reluctant qualifier in your template (for more information, see

+13
source

The Python re module supports inextricable matching. Are you just adding ? to the end of the lookup pattern, for example .*? . You can find out more at this HOWTO .

+1
source
 I believe the regex <bla><blabla>87765.*?</blabla><bla> can produce catastrophic backtracking. Instead, use: <bla><blabla>87765[^<]*</blabla><bla> Using atomic grouping (I'm not sure Python supports this), the above regex becomes <bla><blabla>(?>(.*?<))/blabla><bla> 

Everything that is between (?> ...) is considered as one single token using the regular expression mechanism, as soon as the regular expression module leaves the group. Since the entire group is a single marker, no backtracking can take place once the regex engine has found a match for the group. If backtracking is required, the engine should return to the regular expression marker in front of the group (carriage in our example). If there is no token in front of the group, the regular expression should repeat the entire regular expression in the next position on the line. Please note that I needed to include "<" in the group to ensure atomicity. Close enough.

+1
source

Um ... there is a way to say re to find the smallest match, and that's for sure using non-living quantifiers.

 <bla><blabla>87765.*?</blabla><bla> 

I can’t imagine why you would like to do this using greedy quantifiers.

0
source

All Articles