How to match punctuation with regex in Python?

Question

How to match punctuation with regex in Python?

I need solutions to this issue , with the exception of Python! I tried to install the regex library for Python, as is obvious , which allows the use of POSIX expressions in Python regular expressions, but, nevertheless, I assume that this does not include Unicode characters in the [:alpha:] class. For example:.

 >>> re.search(r'[[:alpha:] ]+','Please work blåbær and NOW stop 123').group(0) 'Please work bl'

When I want it to match Please work blåbær and NOW stop

EDIT: I am using Python 2.7

EDIT 2: I tried the following:

 >>> re.search(re.compile('[\w ]+', re.UNICODE),'Please work blåbær and NOW stop 123').group(0) 'Please work bl\xc3'

Not quite what I wanted (I want to match the part after the first character other than ASCII), but at least it matched the character more than before. What should I do here to bring it in line with the rest, what do I want?

EDIT 3: I don't want to match characters without words; By "word" I mean az, AZ, space and any accented variations of word characters. I hope I have my own idea; in a phrase like

 lets match força, but stop before that comma

I want to combine only lets match força

EDIT 4: So I tried using Python 3 just for this script:

 >>> re.search(re.compile('[\w ]+', re.UNICODE),'lets match força, but stop before that comma').group(0) 'lets match força'

I think it works for the most part in Python 3, except that it also matches numbers (which I definitely don't want) and underscores. Any way to fix this, in Python 2 or 3?

+7

python regex unicode non-ascii-characters

wrongusername Nov 07 '12 at 1:01

source share

1 answer

Don question · Answer 1 · 2012-11-07T01:15:07+0000

It is not clear which version of python you are using. if you are using 2.x then there may be a problem with unicode. see this post for further pointers and feel free to update your question for more details.

Im pretty unexpected that I cannot convert the accented character to the correct unicode representation ...

but there is a workaround:

 re.search(re.compile('((\w+\s)|(\w+\W+\w+\s))+', re.UNICODE), ur'Please work blåbær and NOW stop 123').group(0)

or

 re.search(re.compile('\D+', re.UNICODE), ur'Please work blåbær and NOW stop 123').group(0)

How to match punctuation with regex in Python?

More articles: