Unicode match in ply regular expressions

I am matching identifiers, but now I have a problem: my identifiers are allowed to contain Unicode characters. Therefore, the old way to do something is not enough:

t_IDENTIFIER = r"[A-Za-z](\\.|[A-Za-z_0-9])*" 

In my parser markup language, I match Unicode characters, resolving all characters except the ones I explicitly use, because my markup language has only two or three of the characters that I need to avoid.

How can I match all Unicode characters to python and ply regular expressions? Also is this a good idea at all?

I want people to use identifiers like Ω "" ° foo² väli π as identifiers (variable names, etc.) in their programs. Heck! I want people to be able to write programs in their own language, if practical! In any case, unicode is currently supported in a variety of places, and it should be distributed.

Edit: POSIX character classes don't seem to be recognized by python regular expressions.

 >>> import re >>> item = re.compile(r'[[:word:]]') >>> print item.match('e') None 

Edit: It’s better to explain what I need. I will need a regular expression that matches all characters typing unicode, but not ASCII characters at all.

Edit: r "\ w" does a bit of the thing that I want, but it does not match "", and I also need a regular expression that does not match numbers.

+4
source share
5 answers

The re module supports the \ w syntax, which:

If UNICODE is set, it will match the characters [0-9_] plus anything that is classified as alphanumeric in the Unicode Character Properties Database.

therefore, the following examples show how to map Unicode identifiers:

 >>> import re >>> m = re.compile('(?u)[^\W0-9]\w*') >>> m.match('a') <_sre.SRE_Match object at 0xb7d75410> >>> m.match('9') >>> m.match('ab') <_sre.SRE_Match object at 0xb7c258e0> >>> m.match('a9') <_sre.SRE_Match object at 0xb7d75410> >>> m.match('unicöde') <_sre.SRE_Match object at 0xb7c258e0> >>> m.match('ödipus') <_sre.SRE_Match object at 0xb7d75410> 

So, the expression you are looking for is: (? U) [^ \ W0-9] \ w *

+5
source

You need to use the reflags of the pass pass parameter in the lex.lex file:

 lex.lex(reflags=re.UNICODE) 
+3
source

Check the answers to this question.

Removing non-printable characters from a string in python

you just need to use other Unicode character categories instead

+1
source

Solved it using Vinko.

I realized that getting the unicode range is just dumb. So I will do this:

 symbols = re.escape(''.join([chr(i) for i in xrange(33, 127) if not chr(i).isalnum()])) symnums = re.escape(''.join([chr(i) for i in xrange(33, 127) if not chr(i).isalnum()])) t_IDENTIFIER = "[^%s](\\.|[^%s])*" % (symnums, symbols) 

I do not know about Unicode character classes. If this stuff in Unicode starts to get too complicated, I can just put the original back in place. Support UTF-8 still provides support on STRING tokens, which is more important.

Edit: On the other hand, I am beginning to understand why programming languages ​​do not have much Unicode support. This is an ugly hack, not a satisfactory solution.

+1
source

Perhaps the POSIX character classes are for you?

0
source

All Articles