Negative lookahead expression not working in python

Task:
- given: image list file names
- todo: create a new list with file names that do not contain the word "thumb" - i.e. target only images without thumbnails (with PIL - Python Image Library).

I tried r".*(?!thumb).*" , But it failed.

I found a solution (here in stackoverflow) to add ^ to the regex and put .* In the negative lookahead: r"^(?!.*thumb).*" And now it works.

The fact is that I would like to understand why my first solution did not work, but I do not. Since regular expressions are quite complex, I would really like to understand them.

I understand that ^ tells the parser that the following condition must match at the beginning of the line. But doesn't .* Work in the (not working) first example at the beginning of the line? I thought it would start at the beginning of the line and start searching for as many characters as possible before reaching the thumb. If so, he will return the discrepancy.

Can someone explain why r".*(?!thumb).*" Does not work, but r"^(?!.*thumb).*" Does?

Thanks!

+6
source share
3 answers

(Darn, John beat me up. Well, you can see examples anyway)

Like the other guys, regex is not the best tool for this job. If you work with file paths, see os.path .

As for filtering files you donโ€™t need, you can do if 'thumb' not in filename: ... after you split the path (where filename is str ).

And for posterity, here are my thoughts on this regular expression. r".*(?!thumb).*" does not work because .* is greedy and lookahead has a very low priority. Take a look at this:

 >>> re.search('(.*)((?!thumb))(.*)', '/tmp/somewhere/thumb').groups() ('/tmp/somewhere/thumb', '', '') >>> re.search('(.*?)((?!thumb))(.*)', '/tmp/somewhere/thumb').groups() ('', '', '/tmp/somewhere/thumb') >>> re.search('(.*?)((?!thumb))(.*?)', '/tmp/somewhere/thumb').groups() ('', '', '') 

The latter is rather strange ...

Another regular expression ( r"^(?!.*thumb).*" ) Works because .* Is inside lookahead, so you have no problem with stolen characters. In fact, you donโ€™t even need ^ , depending on whether you re.match or re.search :

 >>> re.search('((?!.*thumb))(.*)', '/tmp/somewhere/thumb').groups() ('', 'humb') >>> re.search('^((?!.*thumb))(.*)', '/tmp/somewhere/thumb').groups() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'NoneType' object has no attribute 'groups' >>> re.match('((?!.*thumb))(.*)', '/tmp/somewhere/thumb').groups() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'NoneType' object has no attribute 'groups' 
+2
source

Can someone explain why r".*(?!thumb).*" Does not work, but r"^(?!.*thumb).*" Does?

The first will always match, because .* Will consume the entire string (so there can be nothing behind it that would negatively distort). The second is a little collapsed and will correspond to the beginning of the line, the largest number of characters until it meets the โ€œthumbโ€, and if this happens, the whole match will fail, since the line starts with something followed by โ€œbigโ€ finger ".

Number two is easier to write as:

  • 'thumb' not in string
  • not re.search('thumb', string) (instead of matching)

Also, as I mentioned in the comments, your question says:

file names not containing the word thumb

So you can think about whether thumbs up should be excluded or not.

+5
source

Ignoring all bits of regular expressions, your task seems relatively simple:

  • current: list of image file names
  • todo: create a new list with file names that do not contain the word "thumb" - i.e. use only images without thumbnails (with PIL - Python Image Library).

Assuming you have a list of file names that look something like this:

 filenames = [ 'file1.jpg', 'file1-thumb.jpg', 'file2.jpg', 'file2-thumb.jpg' ] 

Then you can get a list of files not containing the word thumb, like this:

 not_thumb_filenames = [ filename for filename in filenames if not 'thumb' in filename ] 

What we call list comprehension is essentially being abbreviated:

 not_thumb_filenames = [] for filename in filenames: if not 'thumb' in filename: not_thumb_filenames.append(filename) 

Regular expressions are not needed for this simple task.

0
source

All Articles