ANTLR 4 lexer tokens inside other tokens

I have the following grammar for ANTLR 4:

grammar Pattern; //parser rules parse : string LBRACK CHAR DASH CHAR RBRACK ; string : (CHAR | DASH)+ ; //lexer rules DASH : '-' ; LBRACK : '[' ; RBRACK : ']' ; CHAR : [A-Za-z0-9] ; 

And I'm trying to parse the next line

 ab-cd[0-9] 

The code parses ab-cd on the left, which will be considered as a literal string in my application. Then it parses [0-9] as a set of characters, which in this case will translate to any digit. My grammar works for me, except that I do not like to have (CHAR | DASH)+ as a rule, a parser when it is simply considered as a token. I would prefer lexer to create a STRING token and give me the following tokens:

 "ab-cd" "[" "0" "-" "9" "]" 

instead of these

 "ab" "-" "cd" "[" "0" "-" "9" "]" 

I looked at other examples, but could not figure it out. Typically, other examples have quotation marks around such string literals, or they have spaces to help distinguish between input. I would like to avoid both. Can this be done using lexer rules, or do I need to continue to process it in the parser rules, as I do?

+7
source share
1 answer

In ANTLR 4, you can use lexer modes for this.

 STRING : [az-]+; LBRACK : '[' -> pushMode(CharSet); mode CharSet; DASH : '-'; NUMBER : [0-9]+; RBRACK : ']' -> popMode; 

After parsing the character [ lexer will work in CharSet mode until the character is reached ] and the popMode command is popMode .

+7
source

All Articles