Grammar for the simultaneous analysis of multiline comments and string literals

I am trying to parse C ++ / Java style source files and would like to isolate comments, string literals and spaces as tokens.

For spaces and comments, a solution is usually suggested (using ANTLR grammar):

// WS comments***************************** WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;}; ML_COMMENT: '/*' (options {greedy=false;}: .)* '*/' {$channel=HIDDEN;}; SL_COMMENT: '//' (options {greedy=false;}: .)* '\r'? '\n' {$channel=HIDDEN;}; 

But the problem is that my source files also consist of string literals, for example.

 printf(" /* something looks like comment and whitespace \n"); printf(" something looks like comment and whitespace */ \n"); 

Everything inside "" should be considered as one token, but my lexer ANTLR rules will obviously consider their ML_COMMENT token:

  /* something looks like comment and whitespace \n"); printf(" something looks like comment and whitespace */ 

But I canโ€™t create another lexer rule to define the token as something inside the pair "(assuming the escape sequence is handled properly), because it will be considered an erroneous character in the string:

 /* comment...."comment that looks */ /*like a string literal"...more comment */ 

In short, 2 pairs / ** / and "" will interfere with each other, because each of them may contain the beginning of the other as real content. So, how do we define the lexer grammar to handle both cases?

+4
source share
2 answers

JavaMan wrote:

I am trying to parse C ++ / Java source files and would like to isolate comment, string literal and spaces as tokens.

Shouldn't you match with char literals? Consider:

 char c = '"'; 

The double quote should not be considered the beginning of a string literal!

JavaMan wrote:

In short, 2 pairs / ** / and "" will interfere with each other.

Err, no. If a /* "visible" first, it will consume all the way to the first */ . To enter a type:

 /* comment...."comment that looks like a string literal"...more comment */ 

this would mean double quotes are also used. The same goes for string literals: when a double quote is displayed first, /* and / or */ will be consumed until the next (unscreened) " .

Or didnโ€™t I understand?

Note that you can discard options {greedy=false;}: from your grammar to .* Or .+ , Which are uneven by default.

Here is the way:

 grammar T; parse : (t=. { if($t.type != OTHER) { System.out.printf("\%-10s >\%s<\n", tokenNames[$t.type], $t.text); } } )+ EOF ; ML_COMMENT : '/*' .* '*/' ; SL_COMMENT : '//' ~('\r' | '\n')* ; STRING : '"' (STR_ESC | ~('\\' | '"' | '\r' | '\n'))* '"' ; CHAR : '\'' (CH_ESC | ~('\\' | '\'' | '\r' | '\n')) '\'' ; SPACE : (' ' | '\t' | '\r' | '\n')+ ; OTHER : . // fall-through rule: matches any char if none of the above matched ; fragment STR_ESC : '\\' ('\\' | '"' | 't' | 'n' | 'r') // add more: Unicode esapes, ... ; fragment CH_ESC : '\\' ('\\' | '\'' | 't' | 'n' | 'r') // add more: Unicode esapes, Octal, ... ; 

which can be tested with:

 import org.antlr.runtime.*; public class Main { public static void main(String[] args) throws Exception { String source = "String s = \" foo \\t /* bar */ baz\";\n" + "char c = '\"'; // comment /* here\n" + "/* multi \"no string\"\n" + " line */"; System.out.println(source + "\n-------------------------"); TLexer lexer = new TLexer(new ANTLRStringStream(source)); TParser parser = new TParser(new CommonTokenStream(lexer)); parser.parse(); } } 

If you run the class above, the following will be printed to the console:

 String s = " foo \t /* bar */ baz"; char c = '"'; // comment /* here /* multi "no string" line */ ------------------------- 
 SPACE > < SPACE > < SPACE > < STRING >" foo \t /* bar */ baz"< SPACE > < SPACE > < SPACE > < SPACE > < CHAR >'"'< SPACE > < SL_COMMENT >// comment /* here< SPACE > < ML_COMMENT >/* multi "no string" line */< 
+4
source

Basically your problem: inside a string literal, you need to ignore comments (/ * and //) and vice versa. IMO this can only be solved by sequential reading. when passing through the source file character by character, you can refer to this as a state machine with the states Text, BlockComment, LineComment, StringLiteral.

It is a difficult task to try to solve using regular expression or even grammar.

Remember that any C / C ++ / C # / Java lexer should also handle this same problem. I am quite sure that he is using a state machine solution. So I suggest, if possible, set up your lexer in this way.

-3
source

All Articles