C / C ++ source syntax analysis: How are border / marker interactions in lex / yacc denoted?

I want to analyze some C ++ code, and as a guide I looked at the definitions of C lex / yacc: http://www.lysator.liu.se/c/ANSI-C-grammar-l.html and http: // www.lysator.liu.se/c/ANSI-C-grammar-y.html

I understand the specifications of the tokens themselves, but not how they interact. eg. it is normal for an operator such as = to immediately follow the identifier without an intermediate space (ie, "foo ="), but not to have a numeric constant immediately followed by the identifier (i.e. 123foo). However, I do not see a way to present such rules.

What am I missing? ... or is this lex / yacc too liberal in accepting errors.

+4
source share
4 answers

The lexer converts the character stream to a token stream (I think you mean by the token specification). Grammar determines which token sequences are acceptable. Therefore, you will not see that something is not allowed; you see only what is permitted. Does this make sense?

EDIT

If the point is to make the lexer distinguish the sequence “123foo” from the sequence “123 foo”, one way is to add the specification for “123foo”. Another way is to consider spaces as meaningful.

EDIT2

A syntax error can be "detected" from the lexer or from the production of the grammar or from the subsequent steps of the compiler (for example, errors of the type that are still "syntax errors"). I think what part of the whole compilation process detects which error is largely a design problem (since it affects the quality of error messages). In this example, it probably makes sense to outlaw “123foo” through tokenization as an invalid token, rather than relying on the non-existence of production with a numeric literal followed by an identifier (at least this is the behavior of GCC).

+3
source

The lexer is ok with 123foo and will split it into two tokens.

  • Integer constant
  • and identifier.

But try to find the part in the syntax that allows these two tokens to sit side by side. Thus, I am sure that the lexer generates an error when it sees these two tokens.

Note that the lexer does not care about spaces (unless you explicitly say so, be careful). In this case, it just throws a space:

[ \t\v\n\f] { count(); } // Throw away white space without looking. 

Just to check what I created:

 wget http://www.lysator.liu.se/c/ANSI-C-grammar-l.html > ll wget http://www.lysator.liu.se/c/ANSI-C-grammar-y.html > yy 

Edited ll file to stop at the compiler complaining about undeclared functions:

 #include "y.tab.h" // Add the following lines int yywrap(); void count(); void comment(); void count(); int check_type(); // Done adding lines %} 

Create the following file: main.c:

 #include <stdio.h> extern int yylex(); int main() { int x; while((x = yylex()) != 0) { fprintf(stdout, "Token(%d)\n", x); } } 

Build it:

 $ bison -d yy yy: conflicts: 1 shift/reduce $ flex ll $ gcc main.c lex.yy.c $ ./a.out 123foo 123Token(259) fooToken(258) 

Yes, he divided it into two tokens.

+1
source

what basically happens, lexical rules for each type of token are greedy. For example, the sequence of characters foo= cannot be interpreted as a single identifier, since identifiers do not contain characters. 123abc on the other hand, is actually a numerical constant, albeit distorted, because numerical constants can end with a sequence of alphabetic characters that are used to express a type of numerical constant.

0
source

You will not be able to parse C ++ with lex and yacc, as this is an ambiguous grammar. You will need a more powerful approach, such as GLR or some kind of hacker solution that modifies lexer at runtime (which is what most modern C ++ parsers do).

Take a look at Elsa / Elkhound.

0
source

All Articles