Problem with Ply Lex analysis

I use ply as my lex parser. My specifications are as follows:

t_WHILE = r'while' t_THEN = r'then' t_ID = r'[a-zA-Z_][a-zA-Z0-9_]*' t_NUMBER = r'\d+' t_LESSEQUAL = r'<=' t_ASSIGN = r'=' t_ignore = r' \t' 

When I try to parse the following line:

 "while n <= 0 then h = 1" 

He gives the following conclusion:

 LexToken(ID,'while',1,0) LexToken(ID,'n',1,6) LexToken(LESSEQUAL,'<=',1,8) LexToken(NUMBER,'0',1,11) LexToken(ID,'hen',1,14) ------> PROBLEM! LexToken(ID,'h',1,18) LexToken(ASSIGN,'=',1,20) LexToken(NUMBER,'1',1,22) 

It does not recognize the THEN token; instead, it takes "hen" as an identifier.

Any ideas?

+6
parsing lex ply
source share
2 answers

The reason this didn't work is due to the way ply suspends token matches, the longest regular expression is checked first.

The easiest way to prevent this problem is to match identifiers and reserved words of the same type and select the appropriate type of token depending on the match. The following code is similar to the example in the ply documentation

 import ply.lex tokens = [ 'ID', 'NUMBER', 'LESSEQUAL', 'ASSIGN' ] reserved = { 'while' : 'WHILE', 'then' : 'THEN' } tokens += reserved.values() t_ignore = ' \t' t_NUMBER = '\d+' t_LESSEQUAL = '\<\=' t_ASSIGN = '\=' def t_ID(t): r'[a-zA-Z_][a-zA-Z0-9_]*' if t.value in reserved: t.type = reserved[ t.value ] return t def t_error(t): print 'Illegal character' t.lexer.skip(1) lexer = ply.lex.lex() lexer.input("while n <= 0 then h = 1") while True: tok = lexer.token() if not tok: break print tok 
+7
source share

PLY pauses tokens declared as simple strings according to the longest regular expression, but tokens declared as functions have a priority order.

From the docs:

When creating the main regular expression, the rules are added in the following order:

  • All markers defined by functions are added in the same order as in the lexer file.
  • Tokens defined by strings are then added by sorting them in order to reduce the length of the regular expression (longer expressions are added first).

So, an alternative solution would be to simply specify the markers that you want to prioritize as functions instead of strings, for example:

 def t_WHILE(t): r'while'; return t def t_THEN(t): r'then'; return t t_ID = r'[a-zA-Z_][a-zA-Z0-9_]*' t_NUMBER = r'\d+' t_LESSEQUAL = r'<=' t_ASSIGN = r'=' t_ignore = ' \t' 

This way WHILE and THEN will be the first rules to be added and you will get the expected behavior.

As a side note, you used r' \t' (raw string) for t_ignore, so Python treated \ as a backslash. Instead, it should be a simple string, as in the example above.

+4
source share

All Articles