Unicode match in ply regular expressions

Question

Unicode match in ply regular expressions

I am matching identifiers, but now I have a problem: my identifiers are allowed to contain Unicode characters. Therefore, the old way to do something is not enough:

t_IDENTIFIER = r"[A-Za-z](\\.|[A-Za-z_0-9])*"

In my parser markup language, I match Unicode characters, resolving all characters except the ones I explicitly use, because my markup language has only two or three of the characters that I need to avoid.

How can I match all Unicode characters to python and ply regular expressions? Also is this a good idea at all?

I want people to use identifiers like Ω "" ° foo² väli π as identifiers (variable names, etc.) in their programs. Heck! I want people to be able to write programs in their own language, if practical! In any case, unicode is currently supported in a variety of places, and it should be distributed.

Edit: POSIX character classes don't seem to be recognized by python regular expressions.

 >>> import re >>> item = re.compile(r'[[:word:]]') >>> print item.match('e') None

Edit: It’s better to explain what I need. I will need a regular expression that matches all characters typing unicode, but not ASCII characters at all.

Edit: r "\ w" does a bit of the thing that I want, but it does not match "", and I also need a regular expression that does not match numbers.

+4

python regex unicode character-properties ply

Cheery Oct 26 '08 at 16:35

source share

5 answers

You need to use the reflags of the pass pass parameter in the lex.lex file:

 lex.lex(reflags=re.UNICODE)

+3

Stan Dec 14 '11 at 10:26

source share

Check the answers to this question.

Removing non-printable characters from a string in python

you just need to use other Unicode character categories instead

+1

Vinko vrsalovic Oct 26 '08 at 16:58

source share

Solved it using Vinko.

I realized that getting the unicode range is just dumb. So I will do this:

 symbols = re.escape(''.join([chr(i) for i in xrange(33, 127) if not chr(i).isalnum()])) symnums = re.escape(''.join([chr(i) for i in xrange(33, 127) if not chr(i).isalnum()])) t_IDENTIFIER = "[^%s](\\.|[^%s])*" % (symnums, symbols)

I do not know about Unicode character classes. If this stuff in Unicode starts to get too complicated, I can just put the original back in place. Support UTF-8 still provides support on STRING tokens, which is more important.

Edit: On the other hand, I am beginning to understand why programming languages do not have much Unicode support. This is an ugly hack, not a satisfactory solution.

+1

Cheery Oct 26 '08 at 17:19

source share

Perhaps the POSIX character classes are for you?

0

Tomalak Oct 26 '08 at 16:37

source share

Florian Bösch · Accepted Answer · 2008-10-26T21:18:53+0000

The re module supports the \ w syntax, which:

If UNICODE is set, it will match the characters [0-9_] plus anything that is classified as alphanumeric in the Unicode Character Properties Database.

therefore, the following examples show how to map Unicode identifiers:

 >>> import re >>> m = re.compile('(?u)[^\W0-9]\w*') >>> m.match('a') <_sre.SRE_Match object at 0xb7d75410> >>> m.match('9') >>> m.match('ab') <_sre.SRE_Match object at 0xb7c258e0> >>> m.match('a9') <_sre.SRE_Match object at 0xb7d75410> >>> m.match('unicöde') <_sre.SRE_Match object at 0xb7c258e0> >>> m.match('ödipus') <_sre.SRE_Match object at 0xb7d75410>

So, the expression you are looking for is: (? U) [^ \ W0-9] \ w *

Unicode match in ply regular expressions

More articles: