I just noticed a new regex package "on pypi. (If I understand correctly, this is a test version of a new package that will someday replace the stdlib re package).
It looks like it has (among other things) more features regarding unicode. For example, it supports \X , which is used to match a single grapheme (regardless of whether it uses a union or not). It also supports matching properties, blocks, and Unicode scripts, so you can use \p{M} to denote label combinations. \X , mentioned above, is equivalent to \P{M}\p{M}* (a character that is NOT a combination mark followed by zero or more combination marks).
Note that this makes \X more or less equivalent to unicode . and not \w , so in your case \w\p{M}* is what you need.
This is (for now) a non-stdlib package, and I don’t know how ready it is (and it is not included in the binary distribution), but you can try, as it seems, the easiest / most “correct” answer to your question. (otherwise, I think you need to use character ranges explicitly, as described in my comment on the previous answer).
See also this page for information on Unicode regular expressions, which may also contain some useful information for you (and can serve as documentation for some things implemented in the regex package).
Steven
source share