The main trick is to recognize that the parsing, no matter how it is performed, occurs in steps of several steps, including reading the tokens one at a time.
At each additional stage, it is possible to build part of the AST by combining fragments of the AST created by other incremental steps. This is a recursive idea, and it ends in creating the nodes of the AST sheet for tokens when they are checked. This basic idea is found in almost all AST-construction analyzers.
If you create a recursive descent parser, then in reality a joint system of recursive procedures is created, each of which recognizes a nonterminal in any grammar. For pure analysis, each procedure simply returns a boolean value for a "non-terminal (un) recognized".
To build an AST with a recursive descent parser, each of these procedures returns two values: a logical “recognized” and, if recognized, an AST constructed (somehow) for the nonterminal. (A regular hack returns a pointer that is invalid for "unrecognized" or points to a constructed AST, if "recognized"). The method of constructing AST for one procedure is built by combining AST from the subprocedures that it calls. This is pretty trivial to do for leaf procedures that read the input token and can immediately create a tree.
The disadvantage of all this is that you need to manually encode a recursive descent and increase it using the steps of building a tree. In the grand scheme of things, it is actually quite easy to code for small grammars.
For the OP example, suppose we have this grammar:
GOAL = ASSIGNMENT ASSIGNMENT = LHS '=' RHS ';' LHS = IDENTIFIER RHS = IDENTIFIER | NUMBER
OK, our recursive descent parser:
boolean parse_Goal() { if parse_Assignement() then return true else return false } boolean parse_Assignment() { if not Parse_LHS() then return false if not Parse_equalsign() then throw SyntaxError // because there are no viable alternatives from here if not Parse_RHS() then throw SyntaxError if not Parse_semicolon() the throw SyntaxError return true } boolean parse_LHS() { if parse_IDENTIFIER() then return true else return false } boolean parse_RHS() { if parse_IDENTIFIER() then return true if parse_NUMBER() then return true else return false } boolean parse_equalsign() { if TestInputAndAdvance("=") // this can check for token instead then return true else return false } boolean parse_semicolon() { if TestInputAndAdvance(";") then return true else return false } boolean parse_IDENTIFIER() { if TestInputForIdentifier() then return true else return false } boolean parse_NUMBER() { if TestInputForNumber() then return true else return false }
Now, edit it, create an abstract syntax tree:
AST* parse_Goal() // note: we choose to return a null pointer for "false" { node = parse_Assignment() if node != NULL then return node else return NULL } AST* parse_Assignment() { LHSnode = Parse_LHS() if LHSnode == NULL then return NULL EqualNode = Parse_equalsign() if EqualNode == NULL then throw SyntaxError // because there are no viable alternatives from here RHSnode = Parse_RHS() if RHSnode == NULL then throw SyntaxError SemicolonNode = Parse_semicolon() if SemicolonNode == NULL the throw SyntaxError return makeASTNode(ASSIGNMENT,LHSNode,RHSNode) } AST* parse_LHS() { IdentifierNode = parse_IDENTIFIER() if node != NULL then return IdentifierNode else return NULL } AST* parse_RHS() { RHSnode = parse_IDENTIFIER() if RHSnode != null then return RHSnode RHSnode = parse_NUMBER() if RHSnode != null then return RHSnode else return NULL } AST* parse_equalsign() { if TestInputAndAdvance("=") // this can check for token instead then return makeASTNode("=") else return NULL } AST* parse_semicolon() { if TestInputAndAdvance(";") then return makeASTNode(";") else return NULL } AST* parse_IDENTIFIER() { text = TestInputForIdentifier() if text != NULL then return makeASTNode("IDENTIFIER",text) else return NULL } AST* parse_NUMBER() { text = TestInputForNumber() if text != NULL then return makeASTNode("NUMBER",text) else return NULL }
I obviously hush up some details, but I believe that the reader will not have problems filling them out.
Parser generator tools, such as JavaCC and ANTLR, mainly generate recursive descent parsers and have the ability to create trees that work very hard.
Parser generator tools that create bottom-up parsers (YACC, Bison, GLR, ...) also build AST nodes in the same style. However, there are no recursive functions; instead, a stack of tokens, visible and reduced to non-terminals, are controlled by these tools. AST nodes are built on a parallel stack; when restoration occurs, the AST nodes on the stack side covered by the recovery are combined to form a nonterminal AST node to replace them. This happens with “zero-size” stack segments for grammar rules that are also empty, which results in AST nodes (typically for an “empty list” or “missing parameter”) appearing out of nowhere.
With bit languages, writing recursive descent parsers that build trees is pretty practical.
The problem with real languages ​​(whether it's old and gray, like COBOL or hot and shiny like Scala) is that the number of grammar rules is quite large, complicated by the complexity of the language and the persistence of any language committee to constantly add new goodies offered by others languages ​​("language envy", see the evolutionary race between Java, C # and C ++). Now writing a recursive descent parser is out of control, and everyone tends to use parser generators. But even with a syntax generator, writing all the custom code for assembling AST nodes is also a big battle (and we did not discuss what it takes to create a good "abstract" syntax and the first thing that comes to mind). Maintaining the rules of grammar and building an AST building is becoming increasingly difficult with the scale and constant evolution. (If your language is successful, during the year you will want to change it). Therefore, even writing the rules for constructing AST becomes inconvenient.
Ideally, I would like to write a grammar and get a parser and a tree. You can do this with some of the latest parser generators: our DMS Software Reengineering Toolkit accepts full context analyzers and automatically creates an AST , doesn’t work on part of the grammar program; he has been doing this since 1995. The ANTLR guys finally realized this in 2014, and ANTLR4 now offers this option.
Last point: having a parser (even with AST) is hardly a solution to the actual problem that you decide to solve, whatever it is. Its just the foundation, and much to the shock for most novice parsers, this is the smallest part of the tool that manipulates code. Google my essay on life after parsing (or check out my bio) for more details.