Python regex \ w not matching diacritics mix?

I have a UTF8 string with a combination of diacritics. I want to match it with a sequence of regular expressions \w . It matches accented characters, but not if the Latin character is combined with diacritics.

 >>> re.match("a\w\w\wz", u"aoooz", re.UNICODE) <_sre.SRE_Match object at 0xb7788f38> >>> print u"ao\u00F3oz" aoóoz >>> re.match("a\w\w\wz", u"ao\u00F3oz", re.UNICODE) <_sre.SRE_Match object at 0xb7788f38> >>> re.match("a\w\w\wz", u"aoo\u0301oz", re.UNICODE) >>> print u"aoo\u0301oz" aóooz 

(It appears that the SO markdown processor has problems combining diacritics in the above, but there is one on the last line)

In any case, to combine diacritics with \w ? I do not want to normalize the text because this text is on behalf of the file, and I do not want the standard file name unification to be performed anyway. This is Python 2.5.

+7
python regex unicode diacritics unicode-normalization
source share
2 answers

I just noticed a new regex package "on pypi. (If I understand correctly, this is a test version of a new package that will someday replace the stdlib re package).

It looks like it has (among other things) more features regarding unicode. For example, it supports \X , which is used to match a single grapheme (regardless of whether it uses a union or not). It also supports matching properties, blocks, and Unicode scripts, so you can use \p{M} to denote label combinations. \X , mentioned above, is equivalent to \P{M}\p{M}* (a character that is NOT a combination mark followed by zero or more combination marks).

Note that this makes \X more or less equivalent to unicode . and not \w , so in your case \w\p{M}* is what you need.

This is (for now) a non-stdlib package, and I don’t know how ready it is (and it is not included in the binary distribution), but you can try, as it seems, the easiest / most “correct” answer to your question. (otherwise, I think you need to use character ranges explicitly, as described in my comment on the previous answer).

See also this page for information on Unicode regular expressions, which may also contain some useful information for you (and can serve as documentation for some things implemented in the regex package).

+5
source share

You can use unicodedata.normalize to compose a combination of diacritics into a single Unicode character.

 >>> import re >>> from unicodedata import normalize >>> re.match(u"a\w\w\wz", normalize("NFC", u"aoo\u0301oz"), re.UNICODE) <_sre.SRE_Match object at 0x00BDCC60> 

I know that you said that you didn’t want to normalize, but I don’t think that the problem would be with this solution, since you only normalize the line to fit, and you don’t need to change the file name of yourself or something else.

+1
source share

All Articles