I am matching identifiers, but now I have a problem: my identifiers are allowed to contain Unicode characters. Therefore, the old way to do something is not enough:
t_IDENTIFIER = r"[A-Za-z](\\.|[A-Za-z_0-9])*"
In my parser markup language, I match Unicode characters, resolving all characters except the ones I explicitly use, because my markup language has only two or three of the characters that I need to avoid.
How can I match all Unicode characters to python and ply regular expressions? Also is this a good idea at all?
I want people to use identifiers like Ω "" ° foo² väli π as identifiers (variable names, etc.) in their programs. Heck! I want people to be able to write programs in their own language, if practical! In any case, unicode is currently supported in a variety of places, and it should be distributed.
Edit: POSIX character classes don't seem to be recognized by python regular expressions.
>>> import re >>> item = re.compile(r'[[:word:]]') >>> print item.match('e') None
Edit: It’s better to explain what I need. I will need a regular expression that matches all characters typing unicode, but not ASCII characters at all.
Edit: r "\ w" does a bit of the thing that I want, but it does not match "", and I also need a regular expression that does not match numbers.
source share