Building an abstract syntax tree with a list of tokens

Question

Building an abstract syntax tree with a list of tokens

I want to build an AST from a list of tokens. I am making a scripting language and I have already done part of the lexical analysis, but I have no idea how to create an AST. So the question is how to take something like this:

WORD, int WORD, x SYMBOL, = NUMBER, 5 SYMBOL, ;

and convert it to an abstract syntax tree? Preferably, I would like to do this without a library such as ANTLR or anything else, I would rather try it myself from scratch. However, if this is really a difficult task, I do not mind using the library :) Thanks

+20

java interpreter abstract-syntax-tree

metro-man Jul 31 '14 at 2:08

source share

2 answers

It's not hard; in fact, this is one of the simplest things I have done. The general idea is that each structure (the so-called parser rules) is just a list of other structures, and when the parse () function is called, they simply iterate over their children and tell them to parse. This is not an endless cycle; markers are structures, and when their parsing () is called, they look at the output of the lexer. They must also have a name for identification, but this is not required. parse () usually returns a parse tree. Parsi trees are like structures - lists of children. It is also useful to have a text field and its parent structure for identification. Here is an example (you would like to better organize it and handle nulls for real projects):

 public void push(ParseTree tree) { // ParseTree children.add(tree); text += tree.text; } public ParseTree parse() { // Structure ParseTree tree = new ParseTree(this); for(Structure st: children) { tree.push(st.parse()); } return tree; } public ParseTree parse() { // Token if(!lexer.nextToken() || !matches(lexer.token)) return null; ParseTree tree = new ParseTree(this); tree.text = lexer.token; return tree; }

There. Call the main parsing (), and you got an AST. Of course, this is a very simple example, and it will not work out of the box. It is also useful to have "modifiers"; for example, to tie child 3 one or more times, child 2 is optional. It is also easy to do; save them in an array of the same size as your child, and when parsing, check it:

 public void setModifier(int id, int mod) { mods[id] = mod; } public ParseTree parse() { ... ParseTree t; switch(mods[i]) { case 1: // Optional if((t = st.parse()) != null) tree.push(t); case 2: // Zero or more times while((t = st.parse()) != null) tree.push(t); ... default: tree.push(st.parse()); } ... }

+1

jv110 Oct 26 '16 at 0:14

source share

Ira Baxter · Accepted Answer · 2014-08-03 16:26

The main trick is to recognize that the parsing, no matter how it is performed, occurs in steps of several steps, including reading the tokens one at a time.

At each additional stage, it is possible to build part of the AST by combining fragments of the AST created by other incremental steps. This is a recursive idea, and it ends in creating the nodes of the AST sheet for tokens when they are checked. This basic idea is found in almost all AST-construction analyzers.

If you create a recursive descent parser, then in reality a joint system of recursive procedures is created, each of which recognizes a nonterminal in any grammar. For pure analysis, each procedure simply returns a boolean value for a "non-terminal (un) recognized".

To build an AST with a recursive descent parser, each of these procedures returns two values: a logical “recognized” and, if recognized, an AST constructed (somehow) for the nonterminal. (A regular hack returns a pointer that is invalid for "unrecognized" or points to a constructed AST, if "recognized"). The method of constructing AST for one procedure is built by combining AST from the subprocedures that it calls. This is pretty trivial to do for leaf procedures that read the input token and can immediately create a tree.

The disadvantage of all this is that you need to manually encode a recursive descent and increase it using the steps of building a tree. In the grand scheme of things, it is actually quite easy to code for small grammars.

For the OP example, suppose we have this grammar:

 GOAL = ASSIGNMENT ASSIGNMENT = LHS '=' RHS ';' LHS = IDENTIFIER RHS = IDENTIFIER | NUMBER

OK, our recursive descent parser:

 boolean parse_Goal() { if parse_Assignement() then return true else return false } boolean parse_Assignment() { if not Parse_LHS() then return false if not Parse_equalsign() then throw SyntaxError // because there are no viable alternatives from here if not Parse_RHS() then throw SyntaxError if not Parse_semicolon() the throw SyntaxError return true } boolean parse_LHS() { if parse_IDENTIFIER() then return true else return false } boolean parse_RHS() { if parse_IDENTIFIER() then return true if parse_NUMBER() then return true else return false } boolean parse_equalsign() { if TestInputAndAdvance("=") // this can check for token instead then return true else return false } boolean parse_semicolon() { if TestInputAndAdvance(";") then return true else return false } boolean parse_IDENTIFIER() { if TestInputForIdentifier() then return true else return false } boolean parse_NUMBER() { if TestInputForNumber() then return true else return false }

Now, edit it, create an abstract syntax tree:

 AST* parse_Goal() // note: we choose to return a null pointer for "false" { node = parse_Assignment() if node != NULL then return node else return NULL } AST* parse_Assignment() { LHSnode = Parse_LHS() if LHSnode == NULL then return NULL EqualNode = Parse_equalsign() if EqualNode == NULL then throw SyntaxError // because there are no viable alternatives from here RHSnode = Parse_RHS() if RHSnode == NULL then throw SyntaxError SemicolonNode = Parse_semicolon() if SemicolonNode == NULL the throw SyntaxError return makeASTNode(ASSIGNMENT,LHSNode,RHSNode) } AST* parse_LHS() { IdentifierNode = parse_IDENTIFIER() if node != NULL then return IdentifierNode else return NULL } AST* parse_RHS() { RHSnode = parse_IDENTIFIER() if RHSnode != null then return RHSnode RHSnode = parse_NUMBER() if RHSnode != null then return RHSnode else return NULL } AST* parse_equalsign() { if TestInputAndAdvance("=") // this can check for token instead then return makeASTNode("=") else return NULL } AST* parse_semicolon() { if TestInputAndAdvance(";") then return makeASTNode(";") else return NULL } AST* parse_IDENTIFIER() { text = TestInputForIdentifier() if text != NULL then return makeASTNode("IDENTIFIER",text) else return NULL } AST* parse_NUMBER() { text = TestInputForNumber() if text != NULL then return makeASTNode("NUMBER",text) else return NULL }

I obviously hush up some details, but I believe that the reader will not have problems filling them out.

Parser generator tools, such as JavaCC and ANTLR, mainly generate recursive descent parsers and have the ability to create trees that work very hard.

Parser generator tools that create bottom-up parsers (YACC, Bison, GLR, ...) also build AST nodes in the same style. However, there are no recursive functions; instead, a stack of tokens, visible and reduced to non-terminals, are controlled by these tools. AST nodes are built on a parallel stack; when restoration occurs, the AST nodes on the stack side covered by the recovery are combined to form a nonterminal AST node to replace them. This happens with “zero-size” stack segments for grammar rules that are also empty, which results in AST nodes (typically for an “empty list” or “missing parameter”) appearing out of nowhere.

With bit languages, writing recursive descent parsers that build trees is pretty practical.

The problem with real languages (whether it's old and gray, like COBOL or hot and shiny like Scala) is that the number of grammar rules is quite large, complicated by the complexity of the language and the persistence of any language committee to constantly add new goodies offered by others languages ("language envy", see the evolutionary race between Java, C # and C ++). Now writing a recursive descent parser is out of control, and everyone tends to use parser generators. But even with a syntax generator, writing all the custom code for assembling AST nodes is also a big battle (and we did not discuss what it takes to create a good "abstract" syntax and the first thing that comes to mind). Maintaining the rules of grammar and building an AST building is becoming increasingly difficult with the scale and constant evolution. (If your language is successful, during the year you will want to change it). Therefore, even writing the rules for constructing AST becomes inconvenient.

Ideally, I would like to write a grammar and get a parser and a tree. You can do this with some of the latest parser generators: our DMS Software Reengineering Toolkit accepts full context analyzers and automatically creates an AST , doesn’t work on part of the grammar program; he has been doing this since 1995. The ANTLR guys finally realized this in 2014, and ANTLR4 now offers this option.

Last point: having a parser (even with AST) is hardly a solution to the actual problem that you decide to solve, whatever it is. Its just the foundation, and much to the shock for most novice parsers, this is the smallest part of the tool that manipulates code. Google my essay on life after parsing (or check out my bio) for more details.

Building an abstract syntax tree with a list of tokens

More articles: