How to make Python negative lookbehind less greedy?

I read all the related posts and browsed the internet, but it really beats me up.

I have a text containing a date.
I would like to fix the date, but not if it is preceded by a certain phrase.

A simple solution is to add a negative lookbehind to my RegEx.

Here are some examples (using findall).
I only want to fix the date, if it is not preceded by the phrase "as of".

19-2-11
something 15-4-11
such as, for example, 29-5-11

Here is my regex:

(?<!as of )(\d{1,2}-\d{1,2}-\d{2}) 

Expected results:

['19 -2-11 ']
['15 -4-11 ']
[]

Actual Results:

['19 -2-11 ']
['15 -4-11 ']
['9-5-11']

Note that 9 is not 29. If I changed \d{1,2} to something solid, like \d{2} , to the first pattern:

 bad regex for testing: (?<!as of )(\d{2}-\d{1,2}-\d{2}) 

Then I get the expected results. Of course, this is not good, because I would like to correspond to 2-digit days, as well as single-digit days.

Apparently, my negative look - this is greed - is more than my capture of the date, so he steals a number from it and fails. I tried all the means for correcting greed, which I can think of, but I just do not know to fix it.

I would like my date capture to match maximum greed, and then my negative lookbehind would be applied. Is it possible? My problem seemed to be good use of negative attitudes and not too complicated. I'm sure I can do it differently if necessary, but I would like to know how to do it.

How to make Python negative lookbehind less greedy?

+8
python regex
source share
3 answers

The reason is not that lookbehind is greedy. This is because the regex engine tries to match a pattern at every position it can.

He advances in the phrase such and such as of 29-5-11 , which successfully matches (?<!as of ) , but does not match \d{1,2} .

But then the engine finds itself in a position such and such as of !29-5-11 (marked ! ). But here it does not match (?<!as of ) .

And he moves on to the next position: such and such as of 2!9-5-11 . Where does it successfully match (?<!as of ) , and then \d{1,2} .

How to avoid it?

The general solution is to formulate the template as clear as possible .

In this case, I would prefer a digit with the necessary space or the beginning of a line.

 (?<!as of)(?:^|\s+)(\d{1,2}-\d{1,2}-\d{2}) 

Mark Byers solution is also very good.

I think it’s very important to understand why the regex engine behaves this way and gives undesirable results.

By the way, the solution I gave above does not work if there are 2 or more spaces. This does not work, because the position of the fist corresponds here such and such as of ! 29-5-11 such and such as of ! 29-5-11 with the above drawing.

What can be done to avoid this?

Unfortunately, lookbehind in the Python engine regex does not support the + or * quantifiers.

I think that the simplest solution would be to make sure that there are no spaces before (?:^|\s+) (interfering that all spaces are consumed (?:^|\s+) right after any non-spatial text (and in case the text as of ), stop moving forward and backward to the next starting position, starting the search again and again in the next position of the searched text).

 re.search(r'(?<!as of)(?<!\s)(?:^|\s+)(\d{1,2}-\d{1,2}-\d{2})','such and such as of 29-5-11').group(1) 
+1
source share

This has nothing to do with greed. Greed does not change whether the regular expression matches or not - it only changes the order in which the search is performed. The problem here is that your regex needs to be more specific in order to avoid unwanted matches.

To fix this, you may need a word boundary just before your match:

 (?<!as of )\b(\d{1,2}-\d{1,2}-\d{2}) # ^^ add this 
+7
source share

A simple solution is to flush out all strings matching "how" before using a regular expression to isolate dates.

-one
source share

All Articles