What is a tree parser in ANTLR and am I forced to write it?

Question

What is a tree parser in ANTLR and am I forced to write it?

I am writing lexer / parser for a small subset of C in ANTLR that will run in a Java environment. I am new to the world of language grammars and in many ANTLR textbooks, they create AST - an abstract syntax tree, I have to create it and why?

+6

programming-languages parsing antlr grammar

Hipsterzipster Mar 30 '09 at 15:41

source share

3 answers

Creating an AST with ANTLR is included in the grammar. You do not have to do this, but it is a really good tool for more complex requirements. This is a tree tutorial that you can use.

Basically, with ANTLR, when a source receives parsing, you have several options. You can generate code or AST using rewriting rules in your grammar. AST is basically a representation of the memory of your source. You can do a lot from there.

There are many for ANTLR. If you have not already done so, I would recommend getting a book .

+7

Russell Leggett Mar 30 '09 at 16:00

source share

I think the creation of an AST is optional. An abstract syntax tree is useful for subsequent processing, for example, for semantic analysis of the program being analyzed.

Only you can decide whether to create one. If your only goal is parsing, you don't need to generate it. In javacc (similar to ANTLR) there is a utility called JJTree , which allows you to generate AST. Therefore, I believe that this is optional in ANTLR as well.

+1

bruno conde Mar 30 '09 at 16:04

source share

Hipsterzipster · Accepted Answer · 2009-03-30T18:02:15+0000

I found this answer to the jGuru question written by Terence Parr who created ANTLR. I copied this explanation from the site linked here :

Only simple, so-called syntactic directional translations can be performed using actions in the parser. These types of translations can only spit out designs that are functions of the information already seen at this point in the parsing. Tree-parsers allow you to go through the intermediate form and manipulate this tree, gradually turning it into several phases of the translation into the final form, which can be easily printed as a new translation.

Imagine a simple translation task in which you want to print an html page called "There are n elements", where n is the number of identifiers found in the input stream. IDs should be printed after the title as follows:

<html> <head> <title>There are 3 items</title> </head> <body> <ol> <li>Dog</li> <li>Cat</li> <li>Velociraptor</li> </body> </html>

from input

 Dog Cat Velociraptor

So, with simple steps in your grammar, how can you figure out the name? You cannot not read the entire input. So now we know that we need an intermediate form. The best usually is the AST, which I found as it records the input structure. In this case, this is just a list, but it demonstrates my point.

Well, now you know that a tree is a good thing for anything but simple translations. Given AST, how do you get a way out of it? Imagine simple expression trees. One way is to make nodes in the tree of specific classes, such as PlusNode, IntegerNode, and so on. Then you just ask each node to print. To enter 3 + 4 you will have a tree:

+ | 3 to 4

and classes

 class PlusNode extends CommonAST { public String toString() { AST left = getFirstChild(); AST right = left.getNextSibling(); return left + " + " + right; } } class IntNode extends CommonAST { public String toString() { return getText(); } }

Given an expression tree, you can translate it back into text using t.toString (). SO what happened to this? It seems to work fine, right? This seems to work well in this case because it is simple, but I argue that even in this simple example, tree grammars are more readable and are formalized descriptions of what you encoded in PlusNode.toString ().

 expr returns [String r] { String left=null, right=null; } : #("+" left=expr right=expr) {r=left + " + " + right;} | i:INT {r=i.getText();} ;

Note that the concrete class approach ("heterogeneous AST") actually encodes the full recursive descent parser for # (+ INT INT) manually in toString (). As human parser generators, this should make you cringe.;)

The main disadvantage of the heterogeneous AST approach is that it cannot easily access contextual information. In the recursive descent parser, your context is easily accessible, as it can be passed as a parameter. You also know exactly which rule can trigger which other rule (for example, is this expression a WHILE clause or an IF clause?) By looking at the grammar. The PlusNode class above exists in a separate, isolated world, where it has no idea who will reference the toString () method. Worse, the programmer cannot tell in what context he will be called by reading it.

In conclusion, adding actions to your input parser works for very simple translations, where:

the order of the output constructions matches the order of input
all designs can be generated from information parsed to the point where you need to spit them out

In addition, you will need an intermediate form - AST - the best form. Using grammar to describe AST structure is similar to using grammar to analyze your input text. Formalized descriptions in a high-level language, such as ANTLR, are better than hand-coded parsers. Actions in the grammar of the tree have a very clear context and can conveniently access information passed from rlues that call. Tree-driven translations for multi-pass translations are also much simpler using tree grammar.

What is a tree parser in ANTLR and am I forced to write it?

More articles: