JavaMan wrote:
I am trying to parse C ++ / Java source files and would like to isolate comment, string literal and spaces as tokens.
Shouldn't you match with char literals? Consider:
char c = '"';
The double quote should not be considered the beginning of a string literal!
JavaMan wrote:
In short, 2 pairs / ** / and "" will interfere with each other.
Err, no. If a /* "visible" first, it will consume all the way to the first */ . To enter a type:
this would mean double quotes are also used. The same goes for string literals: when a double quote is displayed first, /* and / or */ will be consumed until the next (unscreened) " .
Or didnโt I understand?
Note that you can discard options {greedy=false;}: from your grammar to .* Or .+ , Which are uneven by default.
Here is the way:
grammar T; parse : (t=. { if($t.type != OTHER) { System.out.printf("\%-10s >\%s<\n", tokenNames[$t.type], $t.text); } } )+ EOF ; ML_COMMENT : '/*' .* '*/' ; SL_COMMENT : '//' ~('\r' | '\n')* ; STRING : '"' (STR_ESC | ~('\\' | '"' | '\r' | '\n'))* '"' ; CHAR : '\'' (CH_ESC | ~('\\' | '\'' | '\r' | '\n')) '\'' ; SPACE : (' ' | '\t' | '\r' | '\n')+ ; OTHER : .
which can be tested with:
import org.antlr.runtime.*; public class Main { public static void main(String[] args) throws Exception { String source = "String s = \" foo \\t /* bar */ baz\";\n" + "char c = '\"'; // comment /* here\n" + "/* multi \"no string\"\n" + " line */"; System.out.println(source + "\n-------------------------"); TLexer lexer = new TLexer(new ANTLRStringStream(source)); TParser parser = new TParser(new CommonTokenStream(lexer)); parser.parse(); } }
If you run the class above, the following will be printed to the console:
String s = " foo \t /* bar */ baz"; char c = '"';
SPACE > < SPACE > < SPACE > < STRING >" foo \t /* bar */ baz"< SPACE > < SPACE > < SPACE > < SPACE > < CHAR >'"'< SPACE > < SL_COMMENT >// comment /* here< SPACE > < ML_COMMENT >/* multi "no string" line */<