How to match punctuation with regex in Python?

I need solutions to this issue , with the exception of Python! I tried to install the regex library for Python, as is obvious , which allows the use of POSIX expressions in Python regular expressions, but, nevertheless, I assume that this does not include Unicode characters in the [:alpha:] class. For example:.

 >>> re.search(r'[[:alpha:] ]+','Please work blåbær and NOW stop 123').group(0) 'Please work bl' 

When I want it to match Please work blåbær and NOW stop

EDIT: I am using Python 2.7

EDIT 2: I tried the following:

 >>> re.search(re.compile('[\w ]+', re.UNICODE),'Please work blåbær and NOW stop 123').group(0) 'Please work bl\xc3' 

Not quite what I wanted (I want to match the part after the first character other than ASCII), but at least it matched the character more than before. What should I do here to bring it in line with the rest, what do I want?

EDIT 3: I don't want to match characters without words; By "word" I mean az, AZ, space and any accented variations of word characters. I hope I have my own idea; in a phrase like

 lets match força, but stop before that comma 

I want to combine only lets match força

EDIT 4: So I tried using Python 3 just for this script:

 >>> re.search(re.compile('[\w ]+', re.UNICODE),'lets match força, but stop before that comma').group(0) 'lets match força' 

I think it works for the most part in Python 3, except that it also matches numbers (which I definitely don't want) and underscores. Any way to fix this, in Python 2 or 3?

+7
source share
1 answer

It is not clear which version of python you are using. if you are using 2.x then there may be a problem with unicode. see this post for further pointers and feel free to update your question for more details.

Im pretty unexpected that I cannot convert the accented character to the correct unicode representation ...

but there is a workaround:

 re.search(re.compile('((\w+\s)|(\w+\W+\w+\s))+', re.UNICODE), ur'Please work blåbær and NOW stop 123').group(0) 

or

 re.search(re.compile('\D+', re.UNICODE), ur'Please work blåbær and NOW stop 123').group(0) 
+2
source

All Articles