What is an abstract syntax tree / is it needed?

I was interested in the design / implementation of the compiler / interpreter as long as I programmed (only 5 years), and it always seemed โ€œmagicalโ€ behind the curtains that no one talks about (I know at least 2 forums for developing the operating system, but I I donโ€™t know any community for developing a compiler / interpreter / language). In any case, I recently decided to start working on my own, in the hope of expanding my programming knowledge in general (and, hey, it's pretty fun :). Therefore, based on the limited amount of reading material I have and Wikipedia, I developed this concept of components for the compiler / interpreter:

Source Code โ†’ Lexical Analysis โ†’ Abstract Syntax Tree โ†’ Syntax Analysis โ†’ Semantic Analysis โ†’ Code Generation โ†’ Executable Code.

(I know even more to generate code and executable code, but I haven't got it yet :)

And with this knowledge, I created a very simple lexer (in Java) to enter input from a source file and output tokens to another file. An example I / O will look like this:

Input:

int a := 2 if(a = 3) then print "Yay!" endif 

Exit (from lexer):

 INTEGER A ASSIGN 2 IF L_PAR A COMP 3 R_PAR THEN PRINT YAY! ENDIF 

Personally, I think that it would be very easy to go from there to parsing / semantic analysis and, possibly, even to code generation, which makes me wonder: why use AST when it seems that my Lex does the same job? However, 100% of my sources that I use to study this topic seem adamant that this is a necessary part of any compiler / interpreter. Did I miss the point where AST really is (a tree that shows the logical flow of the program)?

TL DR: Currently, on the way to developing a compiler that has finished the lexer, it seems to me that the result will be simplified for parsing / semantic analysis, and not for AST. So why use it? Did I miss one point?

Thanks!

+6
source share
2 answers

Firstly, one thing about your list of components does not make sense. The construction of AST is (to a large extent) of parsing, so it either should not be there, or at least should be before AST.

You have a lexer. All this gives you individual tokens. In any case, you will need the actual parser, because ordinary languages โ€‹โ€‹are not interesting for programming. You cannot even (correctly) express expressions. Hell, you can't handle operator priority. Token Token does not give you:

  • The idea in which statements and expressions begin and end.
  • The idea is how instructions are grouped in blocks.
  • Idea What part of the expression has priority, associativity, etc.
  • Clear, uncluttered representation in the real structure of the program.
  • A structure that can be passed through many transformations, without any passage, knowing and having code to place that condition in if enclosed in parentheses.
  • ... in a more general sense, any understanding is above the level of one token.

Suppose you have two passes in your compiler that optimize certain types of operators for certain arguments (say, constant folding and algebraic simplifications like x - x -> 0 ). If you pass these tokens for the expression x - x * 1 , these passages will be cluttered to find out that the first part is x * 1 . And they must know this so that the transformation is not wrong (consider 1 + 2 * 3 ).

These things are complicated enough to be eligible, as it is, so you donโ€™t want to be bothered with parsing issues. That is why you first solve the problem of parsing, at a separate parsing stage. Then you can, say, replace the function call with your definition, without worrying about adding parentheses so that the meaning remains the same. You save time, you share problems, avoid repetition, you include simpler code in many other places, etc.

The parser shows all this and builds the AST, which therefore contains all this information. Without any additional data about the nodes, the AST form itself does not. 1, 2, 3 and more, for free. None of the subsequent basillions to follow should worry about this anymore.

This does not mean that you should always have AST. For fairly simple languages, you can make a one-pass compiler. Instead of generating an AST or some other intermediate representation during parsing, you emit code as you go. However, it becomes more difficult for less simple languages, and you cannot intelligently do many things (for example, 70% of all optimizations and diagnostics), and yes, I just made this number up). In general, I would not advise you to do this. There are good reasons that single-pass compilers are mostly dead. Even languages โ€‹โ€‹that allow them (such as C) are currently implemented with multiple passes and AST. This is an easy way to get started, but it will severely limit you (and the language if you run it).

+13
source

You have an AST at the wrong point in your flowchart. As a rule, the lexer output is a series of tokens (as you have on your output), and they go to the syntax analyzer that the AST generates. Thus, the result of your lexer is different from AST because they are used at different points in the compilation process and serve different purposes.

The next logical question: what is AST? So, the goal of syntax analysis is to turn a series of tokens created by a lexer into an AST (or parsing tree). AST is an intermediate representation that captures the relationship between syntax elements in a way that is easier to work with software. One way to think about this is that a text program is a one-dimensional construct and can only represent ideas as a sequence of elements, while the AST is freed from this restriction and can represent the underlying relationships between these elements in two dimensions (as is usually drawn) , or any space with a higher space, if you so decide to think about it.

For example, a binary operator has two operands, let's call them A and B. In the code, it can be written "A * B" (provided that the infix operator is another advantage of AST - it hides differences that may be syntactically important, but not semantically), but for the compiler to "understand" this expression, it must read 5 characters in sequence, and this logic can quickly become cumbersome, given the many possibilities even in a small language. However, in the AST view, we have a โ€œbinary operatorโ€ node, whose value is โ€œ*โ€, and node has two children: the values โ€‹โ€‹โ€œAโ€ and โ€œBโ€.

As your compiler project progresses, I think you will begin to see the benefits of this view.

+8
source

Source: https://habr.com/ru/post/922531/


All Articles