ANTLR rule consumes a fixed number of characters

I am trying to write an ANTLR grammar for the PHP serialize () format, and everything works fine except for the lines. The problem is that the format of serialized strings is:

s:6:"length"; 

In terms of regular expressions, a rule such as s:(\d+):".{\1}"; , will describe this format, if only the number of backlinks is allowed in the count of hits (but it is not).

But I can't find a way to express this for a lexer or parser grammar: the whole idea is to make the number of characters read depend on the backlink describing the number of characters read, as in Fortran Hollerith constants (i.e. 6HLength ), and not on the line separator.

This Fortran ANTLR grammar example seems to point the way, but I don't see how to do it. Please note that my target language is Python, while most of the document and examples for Java are:

 // numeral literal ICON {int counter=0;} : /* other alternatives */ // hollerith 'h' ({counter>0}? NOTNL {counter--;})* {counter==0}? { $setType(HOLLERITH); String str = $getText; str = str.replaceFirst("([0-9])+h", ""); $setText(str); } /* more alternatives */ ; 
+6
serialization parsing antlr
source share
1 answer

Since input, such as s:3:"a"b"; , is valid, you cannot define a String token in your lexer unless the first and last double quotes always start and end. But I think that it is not.

So you need the lexer rule:

 SString : 's:' Int ':"' ( . )* '";' ; 

In other words: match a s: then the integer value, followed by :" , then one or more characters, which can be any, ending in "; . But you need to tell lexer to stop consumption when the Int value is not reached. You can do this by mixing some simple code in your grammar to do this. You can insert simple code by wrapping it inside { and } . Therefore, first convert the value that the Int token stores into an integer variable called chars :

 SString : 's:' Int {chars = int($Int.text)} ':"' ( . )* '";' ; 

Now paste the code inside the loop ( . )* To stop its consumption as soon as the chars counts to zero:

 SString : 's:' Int {chars = int($Int.text)} ':"' ( {if chars == 0: break} . {chars = chars-1} )* '";' ; 

and what is he.

Little demo grammar:

 grammar Test; options { language=Python; } parse : (SString {print 'parsed: [\%s]' \% $SString.text})+ EOF ; SString : 's:' Int {chars = int($Int.text)} ':"' ( {if chars == 0: break} . {chars = chars-1} )* '";' ; Int : '0'..'9'+ ; 

(note that you need to avoid % inside your grammar!)

And the test script:

 import antlr3 from TestLexer import TestLexer from TestParser import TestParser input = 's:6:"length";s:1:""";s:0:"";s:3:"end";' char_stream = antlr3.ANTLRStringStream(input) lexer = TestLexer(char_stream) tokens = antlr3.CommonTokenStream(lexer) parser = TestParser(tokens) parser.parse() 

which produces the following output:

 parsed: [s:6:"length";] parsed: [s:1:""";] parsed: [s:0:"";] parsed: [s:3:"end";] 
+4
source share

All Articles