Using ANTLR to analyze and modify source code; Am I doing it wrong?

Question

Using ANTLR to analyze and modify source code; Am I doing it wrong?

I am writing a program where I need to parse the source JavaScript file, extract some facts and insert / replace parts of the code. A simplified description of what I need to do is this code:

foo(['a', 'b', 'c']);

Extract 'a' , 'b' and 'c' and rewrite the code as:

 foo('bar', [0, 1, 2]);

I am using ANTLR for my parsing needs, creating C # 3 code. Someone else has already injected a JavaScript grammar. Source code parsing works.

The problem I am facing is how to parse and modify the source file correctly. Every approach I try to solve actually solves the problem, leads me to a standstill. I cannot help but think that I am not using the tool because it is intended, or just too many beginners when it comes to AST.

My first approach was to TokenRewriteStream using TokenRewriteStream and implement the partial EnterRule_* methods for the rules we are interested in. Although this seems to greatly simplify changing the flow of tokens, there is not enough context information for my analysis. It seems that all I have access to is a flat stream of tokens that doesn't tell me enough about the whole code structure. For example, to determine whether the function foo is called, just looking at the first token will fail, because it also falsely matches:

 abfoo();

To allow me to do more complex code analysis, my second approach was to change the grammar using rewrite rules to create more trees. Now the first sample block of code produces this:

  Program
     CallExpression
         Identifier ('foo')
         ArgumentList
             Arrayiteral
                 StringLiteral ('a')
                 StringLiteral ('b')
                 StringLiteral ('c')

This works great for code analysis. However, now I can not easily rewrite the code. Of course, I could change the tree structure to represent the code I want, but I cannot use this to output the source code. I was hoping that the token associated with each node would at least give me enough information to find out where in the source text I need to make changes, but all I get is tokens or row / column numbers. To use row and column numbers, I would have to make an inconvenient second pass through the source code.

I suspect I am missing something in understanding how to use ANTLR correctly to do what I need. Is there a better way to solve this problem?

+7

parsing code-analysis antlr antlr3

Jacob Jul 23 '12 at 7:28

source share

3 answers

What you are trying to do is called program conversion , that is, the automatic creation of one program from another. What you do is “wrong,” assumes that the parser is all that you need, and discovers that it is not, and that you need to fill in the gap.

Tools that do this with the help of parsers (for assembling AST) mean modification of AST (both procedural and directional templates) and beautiful printers that convert (modified) AST back to legal source code. You seem to be struggling with the fact that ANTLR does not come with beautiful prints; what is not part of his philosophy; ANTLR is a (excellent) parser. Other answers suggested using ANTLR "string patterns", which in themselves are not quite printable, but can be used to implement it at the selling price. It is harder to do than it seems; see my SO answer on AST compiling back to source .

The real problem here is the widespread but false assumption that "if I have a parser, I am on the way to creating complex software tools for analysis and transformation." See My essay on Life After Parsing for a long discussion of this; basically, you need a lot more tools to make the parser “just” if you don’t want to rebuild a significant part of the infrastructure yourself, instead of doing your job. Other useful features of practical conversion software systems include, as a rule, source-to-source conversion, which greatly simplifies the task of finding and replacing complex patterns in trees.

For example, if you have the ability to convert a source to a source (of our tool, the DMS Software Reengineering Toolkit , you would be able to write portions of your examples using these DMS conversions:

  domain ECMAScript. tag replace; -- says this is a special kind of temporary tree rule barize(function_name:IDENTIFIER,list:expression_list,b:body): expression->expression = " \function_name ( '[' \list ']' ) " -> "\function_name( \firstarg\(\function_name\), \replace\(\list\))"; rule replace_unit_list(s:character_literal): expression_list -> expression_list replace(s) -> compute_index_for(s); rule replace_long_list(s:character_list, list:expression_list): expression_list -> expression_list "\replace\(\s\,\list)-> "compute_index_for\(\s\),\list";

with external meta rules, first_arg procedures (which knows how to calculate the bar with the identifier foo [I assume you want to do this) and compute_index_for, which sets the string literals, knows which integer to replace it.

Individual rewriting rules contain lists of parameters "(....)", which name the slots representing subtrees, the left side acting as a template for matching, and the right side acting as a replacement, as is commonly cited in metaquotes ", which separates the text rewrite text language from text in the language of the specified language (for example, JavaScript). In meta-images that indicate a special element of the rewrite rule language, a lot of meta-escapes ** are found. Usually these are parameter names and represent regardless of the type of name tree, which represents a parameter, or represents a call to an external meta-procedure (for example, first_arg; you will notice that its argument list (,) is metacotated!) or, finally, a “tag”, such as “replace”, which is a kind of tree view, which represents the future intention to make more transformations.

This specific set of rules works by replacing a call to a candidate function with a based version with the additional intention of “replacing” to convert the list. The other two transformations realize the intention, transforming the “replacement” by processing the list items one at a time, and then pushing the replacement further down the list until it finally falls from the end and the replacement is completed. (This is the transformational equivalent of a cycle).

Your specific example may vary somewhat, as you really were not accurate in the details.

By applying these rules to modify the parsing tree, the DMS can then trivially print the result (the default behavior in some configurations is “parse AST, apply rules before exhaustion, prettyprint AST,” because it's convenient).

You can see the complete process of "define language", "define rewriting rules", "apply rules and fingerprint" in (High School) Algebra as a DMS domain .

Other software conversion systems include TXL and Stratego . We present DMS as an industrial version in which we built all this infrastructure, including many standard language parsers and prettyprinters .

+6

Ira Baxter Jul 23 '12 at 13:14

source share

You have looked at the string template library. The same person wrote ANTLR, and they intend to work together. It seems like this will do, do what you are looking for i.e. Displays consistent grammar rules in the form of rich text.

Here is an article on translating through ANTLR

+2

Dave turvey Jul 23 '12 at 8:45

source share

Jacob · Accepted Answer · 2012-07-24T04:35:15+0000

So, it turns out that I can really use the rewrite grammar of the tree and insert / replace tokens using TokenRewriteStream . In addition, this is actually really easy to do. My code resembles the following:

 var charStream = new ANTLRInputStream(stream); var lexer = new JavaScriptLexer(charStream); var tokenStream = new TokenRewriteStream(lexer); var parser = new JavaScriptParser(tokenStream); var program = parser.program().Tree as Program; var dependencies = new List<IModule>(); var functionCall = ( from callExpression in program.Children.OfType<CallExpression>() where callExpression.Children[0].Text == "foo" select callExpression ).Single(); var argList = functionCall.Children[1] as ArgumentList; var array = argList.Children[0] as ArrayLiteral; tokenStream.InsertAfter(argList.Token.TokenIndex, "'bar', "); for (var i = 0; i < array.Children.Count(); i++) { tokenStream.Replace( (array.Children[i] as StringLiteral).Token.TokenIndex, i.ToString()); } var rewrittenCode = tokenStream.ToString();

Using ANTLR to analyze and modify source code; Am I doing it wrong?

More articles: