Parsing C files without preprocessing

I want to run a simple analysis on C files (for example, if you call foo macro with INT_TYPE as an argument, and then send the response to int* ), I don’t want to process this file, I just want to parse it (like, for example, I will have the correct line numbers).

I.e., I want to get from

 #include <ah> #define FOO(f) int f() {FOO(1);} 

list of type tokens

 <include_directive value="ah"/> <macro name="FOO"><param name="f"/><result/></macro> <function name="f"> <return>int</return> <body> <macro_call name="FOO"><param>1</param></macro_call> </body> </function> 

without the need to indicate the path of inclusion, etc.

Is there any existing parser that does this? All the parsers that I know suggest that C is preprocessed. I want to have access to macros and actual instructions.

+4
source share
3 answers

Our C Front End can analyze code containing preprocessor elements, can do it enough, and still create usable ASTs. (Yes, the parse tree has accurate file / row / column information).

There are a number of limitations that allow most code to be processed. In those few cases, it cannot handle it; often a small, easy change to the source file giving equivalent code solves the problem.

Here is an approximate set of rules and restrictions:

  • #includes and #defines can occur wherever there may be a declaration or statement, but not in the middle of an instruction. They rarely cause problems.
  • macro calls can occur when function calls occur in expressions or can appear without a semicolon instead of statements. Macro calls that span non-well-formed chunks are not handled well (was anyone surprised?). The latter occur sometimes, but not infrequently, and require manual revision. The OP example "j (v, oid) *" is problematic, but it is really rare in code.
  • #if ... #endif should be wrapped around the basic concepts of the language (nonterminals) (constant, expression, operator, declaration, function) or sequences of such entities, or around certain incorrectly formed, but usually encountered idioms, such as if ( exp) { . Each symbol arm must contain the same syntax as the other arms. #if wrapped around random text, used as a bad comment, is problematic, but easily captured in the source, making a real comment. If these conditions are not met, you need to change the source code, often moving #if #elsif #else #end several tokens.

In our experience, you can revise a code base of 50,000 lines in a few hours to get around these problems. Although this seems annoying (and it does), the alternative is to not understand the source code at all, which is much worse than annoying.

You also want more than just a parser. See Life After Parsing for what happens after you manage to get the parsing tree. We performed additional work on creating symbol tables in which declarations are written in the context of the preprocessor in which they are embedded, which allows us to check the type that includes the preprocessor conditions.

+1
source

Your specific example can be handled by writing your own parsing and ignoring the macro extension.

Because FOO(1) can be interpreted as a function call.

When considering more cases, the parser is much more complicated. You can find the link in PDF format to find additional information.

-1
source

Source: https://habr.com/ru/post/1413351/


All Articles