ANTLR: How to parse a region in matching brackets using a lexer

Question

ANTLR: How to parse a region in matching brackets using a lexer

I want to parse something like this in my lexer:

( begin expression )

where expressions are also surrounded by parentheses. it doesn’t matter what is in the expression, I just want to have everything between (begin and matching ) as a token. an example would be:

 (begin (define x (+ 1 2)))

therefore the token text should be

 (define x (+ 1 2)))

sort of

 PROGRAM : LPAREN BEGIN .* RPAREN;

(obviously) does not work, because as soon as he sees "), he thinks that the rule is over, but for this I need an appropriate bracket.

How can i do this?

+4

matching brackets antlr lexer

Sebastian Aug 2 '11 at 21:00

source share

1 answer

Bart kiers · Accepted Answer · 2011-08-03T05:01:57+0000

Inside lexer rules, you can invoke rules recursively. So, this is one way to solve this problem. Another approach would be to keep track of the number of open and closed parentheses and allow a closed semantic predicate loop if your counter is greater than zero.

Demonstration:

Tg

 grammar T; parse : BeginToken {System.out.println("parsed :: " + $BeginToken.text);} EOF ; BeginToken @init{int open = 1;} : '(' 'begin' ( {open > 0}?=> // keep reapeating `( ... )*` as long as open > 0 ( ~('(' | ')') // match anything other than parenthesis | '(' {open++;} // match a '(' in increase the var `open` | ')' {open--;} // match a ')' in decrease the var `open` ) )* ;

Main.java

 import org.antlr.runtime.*; public class Main { public static void main(String[] args) throws Exception { String input = "(begin (define x (+ (- 1 3) 2)))"; TLexer lexer = new TLexer(new ANTLRStringStream(input)); TParser parser = new TParser(new CommonTokenStream(lexer)); parser.parse(); } }

 java -cp antlr-3.3-complete.jar org.antlr.Tool Tg javac -cp antlr-3.3-complete.jar *.java java -cp .:antlr-3.3-complete.jar Main parsed :: (begin (define x (+ (- 1 3) 2)))

Note that you need to beware of string literals inside your source, which may include parentheses:

 BeginToken @init{int open = 1;} : '(' 'begin' ( {open > 0}?=> // ... ( ~('(' | ')' | '"') // ... | '(' {open++;} // ... | ')' {open--;} // ... | '"' ... // TODO: define a string literal here ) )* ;

or comments that may contain parentheses.

The predicate clause uses some language-specific code (in this case, Java). The advantage of recursively invoking the lexer rule is that you do not have custom code in your lexer:

 BeginToken : '(' Spaces? 'begin' Spaces? NestedParens Spaces? ')' ; fragment NestedParens : '(' ( ~('(' | ')') | NestedParens )* ')' ; fragment Spaces : (' ' | '\t')+ ;

ANTLR: How to parse a region in matching brackets using a lexer

Demonstration:

Tg

Main.java

More articles: