I need solutions to this issue , with the exception of Python! I tried to install the regex library for Python, as is obvious , which allows the use of POSIX expressions in Python regular expressions, but, nevertheless, I assume that this does not include Unicode characters in the [:alpha:] class. For example:.
>>> re.search(r'[[:alpha:] ]+','Please work blåbær and NOW stop 123').group(0) 'Please work bl'
When I want it to match Please work blåbær and NOW stop
EDIT: I am using Python 2.7
EDIT 2: I tried the following:
>>> re.search(re.compile('[\w ]+', re.UNICODE),'Please work blåbær and NOW stop 123').group(0) 'Please work bl\xc3'
Not quite what I wanted (I want to match the part after the first character other than ASCII), but at least it matched the character more than before. What should I do here to bring it in line with the rest, what do I want?
EDIT 3: I don't want to match characters without words; By "word" I mean az, AZ, space and any accented variations of word characters. I hope I have my own idea; in a phrase like
lets match força, but stop before that comma
I want to combine only lets match força
EDIT 4: So I tried using Python 3 just for this script:
>>> re.search(re.compile('[\w ]+', re.UNICODE),'lets match força, but stop before that comma').group(0) 'lets match força'
I think it works for the most part in Python 3, except that it also matches numbers (which I definitely don't want) and underscores. Any way to fix this, in Python 2 or 3?