The best way to tokenize and parse programming languages ​​in my application

I am working on a tool that will perform some simple transformations in programs (e.g. extract method). To do this, I will have to perform the first few steps of compilation (tokenization, parsing and, possibly, creating a symbol table). I'm going to start with C, and then hopefully extend this to support multiple languages.

My question is: what is the best way to complete these steps, which are:

1.) Does not reinvent the wheel. Obviously, I don't want to write Flex / Bison specifications manually. Am I just grabbing pre-existing specifications and working from there? Does Antlr have a way here?

2.) Expands to multiple languages. Obviously lexing / parsing will be different for everyone, but I would like a solution that I could easily spread to other languages. At least the set of technologies that would make it manageable.

By the way, I use C to write my applications

If anyone has any ideas, that would be great! Thanks!

+6
programming-languages parsing lexer
source share
5 answers

hands down is the best way to do any ANTLR parsing. There are two excellent books on this subject that an author should have. ANTLR Definitive Link: Creating Domain Languages and Language Implementation Templates , both are invaluable resources. ANTLR can generate processing code in different languages.

+7
source share

Since you are going to use already written grammars and regular expressions, the choice of the tool is impossible.

You can go with flex / bison and you will find many grammars already written. Otherwise, you can go with ANTLR , which should work in C, C ++ and Java without problems and do the same for it.

You did not talk about which language you intend to use for this work, so offering a better approach is not so simple.

Think about the fact that each language has its own functions, for example, the symbol table is built differently in Ruby compared to C ++. This is because you can have stricter or weaker ads, etc., so you should think carefully about what you need (and you can also explain this in your question so that I can better help).

From your two phases, I can say that

  • Tokenization is quite simple, does not require different structures for each language, and can be easily expanded to support many programming languages.

  • The analysis may be more complicated. You must create an abstract syntax tree for the program, and then do whatever you want on it. If you like the OOP style, you will need to use a class for each node type, but the node types can change between languages ​​because they are structurally different, so they do something common and easily extensible in another language. difficult..

For this, ANTLR outperforms Flex and Bison because it offers automatic AST creation (if I remember well).

The main difference between the two compiler compilers is that ANTLR uses the LL (k) parser (which is from top to bottom), while Bison uses LALR (1) from bottom to top, but if you are using already written grammars this should not be so difficult.

Personal advice: I wrote many interpreters or compilers, but I never started with a fully functional language. The strong syntax is really great , so maybe you should start with a subset and then see what you can do with tokens and ASTs and then expand it to support the full syntax.

+2
source share

What language do you write your program in?

I would go with antlr (and in fact I am parsing Java). It supports many languages ​​and also contains many grammar examples that you get for free http://www.antlr.org/grammar/list . Unfortunately, they do not have to be perfect (Java grammar does not have AST rules), but they give you a good start, and I believe that the community is large enough for a parser generator.

The wonderful thing, different from many language goals, is that LL (*), combined with the predicates supported by antlr, is very efficient and understandable, and the created parsers too.

With the extension to several languages, I suppose you mean several source languages. This is not easy, but I believe that you may have some success translating them into ASTs that have as many common characters as possible, as well as writing a common tree that can handle the differences in these languages. But it can be quite difficult.

Be careful, however, that online documentation is only good after you read the official antlr book and understand LL (*) and semantic and syntactic predicates.

+2
source share

You did not specify the language, so I just recommend this little stone that I found the other day:

http://irony.codeplex.com/

It is very easy to use and even has built-in grammars for several languages ​​(C # even). There is also pyparsing ( http://pyparsing.wikispaces.com/ ) if you want to use Python as your source language.

+1
source share

The door to go through Eclipse. It has parsing, including error tolerant parsing, for various languages. Eclipse has an internal modularity that allows you to use this functionality without touching the IDE.

-2
source share

All Articles