Since you are going to use already written grammars and regular expressions, the choice of the tool is impossible.
You can go with flex / bison and you will find many grammars already written. Otherwise, you can go with ANTLR , which should work in C, C ++ and Java without problems and do the same for it.
You did not talk about which language you intend to use for this work, so offering a better approach is not so simple.
Think about the fact that each language has its own functions, for example, the symbol table is built differently in Ruby compared to C ++. This is because you can have stricter or weaker ads, etc., so you should think carefully about what you need (and you can also explain this in your question so that I can better help).
From your two phases, I can say that
Tokenization is quite simple, does not require different structures for each language, and can be easily expanded to support many programming languages.
The analysis may be more complicated. You must create an abstract syntax tree for the program, and then do whatever you want on it. If you like the OOP style, you will need to use a class for each node type, but the node types can change between languages ββbecause they are structurally different, so they do something common and easily extensible in another language. difficult..
For this, ANTLR outperforms Flex and Bison because it offers automatic AST creation (if I remember well).
The main difference between the two compiler compilers is that ANTLR uses the LL (k) parser (which is from top to bottom), while Bison uses LALR (1) from bottom to top, but if you are using already written grammars this should not be so difficult.
Personal advice: I wrote many interpreters or compilers, but I never started with a fully functional language. The strong syntax is really great , so maybe you should start with a subset and then see what you can do with tokens and ASTs and then expand it to support the full syntax.
Jack
source share