Python regex - why end of line ($ and \ Z) doesn't work with group expressions?

In Python 2.6. it seems that the end-of-line markers $ and \Z are not compatible with group expressions. Example fo

 import re re.findall("\w+[\s$]", "green pears") 

returns

 ['green '] 

(therefore $ does not work effectively). And using

 re.findall("\w+[\s\Z]", "green pears") 

leads to an error:

 /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.pyc in findall(pattern, string, flags) 175 176 Empty matches are included in the result.""" --> 177 return _compile(pattern, flags).findall(string) 178 179 if sys.hexversion >= 0x02020000: /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.pyc in _compile(*key) 243 p = sre_compile.compile(pattern, flags) 244 except error, v: --> 245 raise error, v # invalid expression 246 if len(_cache) >= _MAXCACHE: 247 _cache.clear() error: internal: unsupported set operator 

Why does it work like this and how to get around?

+8
python regex
source share
3 answers

A [..] expression is a group of characters, that is, it will match any character contained in it. This way you match the letter character $ . A group of characters is always applied to a single input character and therefore can never contain an anchor.

If you want to match the space character or the end of a line, use a non-capturing group instead, combined with | or selector:

 r"\w+(?:\s|$)" 

Alternatively, take a look at the \b word boundary binding. It will occur anywhere when the beginning or end of the \w group (therefore, it is attached to points in the text where the \w symbol precedes or is accompanied by the \w symbol or is at the beginning or end of the line).

+22
source share

Square brackets do not indicate a group; they indicate a character set that matches a single character (any of them in brackets). documented , "special characters lose their special meaning within sets" (unless otherwise specified with classes such as \s ).

If you want to match \s or the end of a line, use something like \s|$ .

+3
source share

Martijn Pieters answer is correct. To develop a little if you use capture groups

 r"\w+(\s|$)" 

You get:

 >>> re.findall("\w+(\s|$)", "green pears") [' ', ''] 

This is because re.findall() returns the values โ€‹โ€‹of the captured group (\s|$) .

Parentheses () are used for two purposes : character groups and captured groups. To disable captured groups, but still act as character groups, use the syntax (?:...) :

 >>> re.findall("\w+(?:\s|$)", "green pears") ['green ', 'pears'] 
0
source share

All Articles