ANTLR: How to parse a region in matching brackets using a lexer

I want to parse something like this in my lexer:

( begin expression ) 

where expressions are also surrounded by parentheses. it doesn’t matter what is in the expression, I just want to have everything between (begin and matching ) as a token. an example would be:

 (begin (define x (+ 1 2))) 

therefore the token text should be

 (define x (+ 1 2))) 

sort of

 PROGRAM : LPAREN BEGIN .* RPAREN; 

(obviously) does not work, because as soon as he sees "), he thinks that the rule is over, but for this I need an appropriate bracket.

How can i do this?

+4
source share
1 answer

Inside lexer rules, you can invoke rules recursively. So, this is one way to solve this problem. Another approach would be to keep track of the number of open and closed parentheses and allow a closed semantic predicate loop if your counter is greater than zero.

Demonstration:

Tg

 grammar T; parse : BeginToken {System.out.println("parsed :: " + $BeginToken.text);} EOF ; BeginToken @init{int open = 1;} : '(' 'begin' ( {open > 0}?=> // keep reapeating `( ... )*` as long as open > 0 ( ~('(' | ')') // match anything other than parenthesis | '(' {open++;} // match a '(' in increase the var `open` | ')' {open--;} // match a ')' in decrease the var `open` ) )* ; 

Main.java

 import org.antlr.runtime.*; public class Main { public static void main(String[] args) throws Exception { String input = "(begin (define x (+ (- 1 3) 2)))"; TLexer lexer = new TLexer(new ANTLRStringStream(input)); TParser parser = new TParser(new CommonTokenStream(lexer)); parser.parse(); } } 
 java -cp antlr-3.3-complete.jar org.antlr.Tool Tg javac -cp antlr-3.3-complete.jar *.java java -cp .:antlr-3.3-complete.jar Main parsed :: (begin (define x (+ (- 1 3) 2))) 

Note that you need to beware of string literals inside your source, which may include parentheses:

 BeginToken @init{int open = 1;} : '(' 'begin' ( {open > 0}?=> // ... ( ~('(' | ')' | '"') // ... | '(' {open++;} // ... | ')' {open--;} // ... | '"' ... // TODO: define a string literal here ) )* ; 

or comments that may contain parentheses.

The predicate clause uses some language-specific code (in this case, Java). The advantage of recursively invoking the lexer rule is that you do not have custom code in your lexer:

 BeginToken : '(' Spaces? 'begin' Spaces? NestedParens Spaces? ')' ; fragment NestedParens : '(' ( ~('(' | ')') | NestedParens )* ')' ; fragment Spaces : (' ' | '\t')+ ; 
+3
source

All Articles