Lexer Python PLY state management from the parser

Question

Lexer Python PLY state management from the parser

I am working on a simple SQL-like query parser, and I need to be able to capture subqueries that can occur literally in certain places. I found lexer states to be the best solution and were able to do POC using curly braces to mark the beginning and end. However, the subqueries will be separated by brackets, not italics, and brackets can occur in other places, so I can’t be a state with every guy open. This information is easily accessible with the help of a parser, so I was hoping to call the beginning and end in the appropriate places in the parser rules. This, however, did not work, because lexer seems to tokenize the stream immediately, and therefore tokens are generated in the INITIAL state. Is there a workaround for this problem? Here is a diagram of what I was trying to do:

def p_value_subquery(p): """ value : start_sub end_sub """ p[0] = "( " + p[1] + " )" def p_start_sub(p): """ start_sub : OPAR """ start_subquery(p.lexer) p[0] = p[1] def p_end_sub(p): """ end_sub : CPAR """ subquery = end_subquery(p.lexer) p[0] = subquery

The start_subquery () and end_subquery () parameters are defined as follows:

 def start_subquery(lexer): lexer.code_start = lexer.lexpos # Record the starting position lexer.level = 1 lexer.begin('subquery') def end_subquery(lexer): value = lexer.lexdata[lexer.code_start:lexer.lexpos-1] lexer.lineno += value.count('\n') lexer.begin('INITIAL') return value

Lexer-lexers just discover a private guy:

 @lex.TOKEN(r"\(") def t_subquery_SUBQST(t): lexer.level += 1 @lex.TOKEN(r"\)") def t_subquery_SUBQEN(t): lexer.level -= 1 @lex.TOKEN(r".") def t_subquery_anychar(t): pass

I would be grateful for any help.

+7

python yacc ply lexer

haridsv Mar 24 '12 at 2:35

source share

3 answers

This answer may be partially useful, but I would also like to take a look at the “6.11 Embedded Actions” section of the PLY documentation (http://www.dabeaz.com/ply/ply.html). In a nutshell, you can write grammar rules in which actions occur in the middle of a rule. It would look something like this:

 def p_somerule(p): '''somerule : AB possible_sub_query LBRACE sub_query RBRACE''' def p_possible_sub_query(p): '''possible_sub_query :''' ... # Check if the last token read was LBRACE. If so, flip lexer state # Sadly, it doesn't seem that the token is easily accessible. Would have to hack it if last_token == 'LBRACE': p.lexer.begin('SUBQUERY')

As for lexer behavior, only one lookahead token is used. Thus, in any grammar rule no more than one additional token was read. If you are going to flip lexer states, you need to make sure that this happens before the token is consumed by the parser, but before the parser asks to read the next incoming token.

Also, if possible, I would try to avoid the yacc () error handling stack until the solution. There is too much black magic in error handling - the more you can avoid it, the better.

At the moment, I put a little pressure on the time, but this is similar to what could be studied for the next version of PLY. Put it on the to-do list.

+5

David beazley Mar 28 '12 at 11:37

source share

Since no one has an answer, he was looking for me to find a workaround, and here's an ugly hack using recovery and reboot ().

 def start_subquery(lexer, pos): lexer.code_start = lexer.lexpos # Record the starting position lexer.level = 1 lexer.begin("subquery") lexer.lexpos = pos def end_subquery(lexer): value = lexer.lexdata[lexer.code_start:lexer.lexpos-1] lexer.lineno += value.count('\n') lexer.begin('INITIAL') return value @lex.TOKEN(r"\(") def t_subquery_SUBQST(t): lexer.level += 1 @lex.TOKEN(r"\)") def t_subquery_SUBQEN(t): lexer.level -= 1 if lexer.level == 0: t.type = "SUBQUERY" t.value = end_subquery(lexer) return t @lex.TOKEN(r".") def t_subquery_anychar(t): pass # NOTE: Due to the nature of the ugly workaround, the CPAR gets dropped, which # makes it look like there is an imbalance. def p_value_subquery(p): """ value : OPAR SUBQUERY """ p[0] = "( " + p[2] + " )" subquery_retry_pos = None def p_error(p): global subquery_retry_pos if p is None: print >> sys.stderr, "ERROR: unexpected end of query" elif p.type == 'SELECT' and parser.symstack[-1].type == 'OPAR': lexer.input(lexer.lexdata) subquery_retry_pos = parser.symstack[-1].lexpos yacc.restart() else: print >> sys.stderr, "ERROR: Skipping unrecognized token", p.type, "("+ \ p.value+") at line:", p.lineno, "and column:", find_column(p.lexer.lexdata, p) # Just discard the token and tell the parser it okay. yacc.errok() def get_token(): global subquery_retry_pos token = lexer.token() if token and token.lexpos == subquery_retry_pos: start_subquery(lexer, lexer.lexpos) subquery_retry_pos = None return token def parse_query(input, debug=0): lexer.input(inp) result = parser.parse(inp, tokenfunc=get_token, debug=0)

+1

haridsv Mar 28 '12 at 6:08

source share

haridsv · Accepted Answer · 2012-03-29T20:19:21+0000

Based on PLY's answer, I came up with this best solution. I have yet to figure out how to return the subquery as a token, but everything else looks much better, and it no longer needs to be considered as a hack.

 def start_subquery(lexer): lexer.code_start = lexer.lexpos # Record the starting position lexer.level = 1 lexer.begin("subquery") def end_subquery(lexer): lexer.begin("INITIAL") def get_subquery(lexer): value = lexer.lexdata[lexer.code_start:lexer.code_end-1] lexer.lineno += value.count('\n') return value @lex.TOKEN(r"\(") def t_subquery_OPAR(t): lexer.level += 1 @lex.TOKEN(r"\)") def t_subquery_CPAR(t): lexer.level -= 1 if lexer.level == 0: lexer.code_end = lexer.lexpos # Record the ending position return t @lex.TOKEN(r".") def t_subquery_anychar(t): pass def p_value_subquery(p): """ value : check_subquery_start OPAR check_subquery_end CPAR """ p[0] = "( " + get_subquery(p.lexer) + " )" def p_check_subquery_start(p): """ check_subquery_start : """ # Here last_token would be yacc lookahead. if last_token.type == "OPAR": start_subquery(p.lexer) def p_check_subquery_end(p): """ check_subquery_end : """ # Here last_token would be yacc lookahead. if last_token.type == "CPAR": end_subquery(p.lexer) last_token = None def p_error(p): global subquery_retry_pos if p is None: print >> sys.stderr, "ERROR: unexpected end of query" else: print >> sys.stderr, "ERROR: Skipping unrecognized token", p.type, "("+ \ p.value+") at line:", p.lineno, "and column:", find_column(p.lexer.lexdata, p) # Just discard the token and tell the parser it okay. yacc.errok() def get_token(): global last_token last_token = lexer.token() return last_token def parse_query(input, debug=0): lexer.input(input) return parser.parse(input, tokenfunc=get_token, debug=0)

Lexer Python PLY state management from the parser

More articles: