Regular expression matching pattern

In regex engines in all languages ​​that I'm familiar with, the notation .* Indicates matching zero or more characters. Consider the following Javascript code:

 var s = "baaabcccb"; var pattern = new RegExp("b.*b"); var match = pattern.exec(s); if (match) alert(match); 

This displays baaabcccb

The same thing happens with Python:

 >>> import re >>> s = "baaabcccb" >>> m = re.search("b.*b", s) >>> m.group(0) 'baaabcccb' 

What is the reason that both of these languages ​​correspond to "baaabcccb" and not just "baaab" ? The way I read the pattern b.*b is to "find a substring that starts with b , then has any number of other characters, and then ends with b ". Both baaab and baaabcccb satisfy this requirement, but both Javascript and Python correspond to the latter. I would expect it to match baaab , simply because this substring satisfies the requirement and appears first.

So why does the pattern match baaabcccb in this case? And is there a way to change this behavior (in any language) so that it matches baaab ?

+4
source share
3 answers

Can you make regex non-greedy by adding ? after * as follows: b.*?b . Then it will match the smallest line. By default, the regex is greedy and will try to find the maximum possible match.

+6
source

.* - a greedy match. .*? - this is not a greedy version

+2
source

Because * and also + are essentially greedy (at least in python, I'm not sure about js). They will try to match as much as possible. if you want to avoid this problem you could add? after them.

Here is a great tutorial on this, in the greedy and non-greedy section: google python class

+2
source

All Articles