Writing a code formatting tool for a programming language

I’m exploring the possibility of writing a code formatting tool for the Apex language, Salesforce.com variations on Java and perhams VisualForce, a tag-based markup language.

I have no idea where to start, other than feeling / knowing that writing a parser from scratch is probably not the best approach.

I have a pretty subtle understanding of what Antlr is and what it does, but conceptually I imagine that you can “train” antlr to understand the syntax of Apex. I could then get a structured version of the code in the data structure (AST?), Which I could then do to create correctly formatted code.

Is this the right concept? Is Antlr a tool for this? Any links to a brief overview of this? I am looking to invest a few days in this task, not months, and I'm not sure that it is even dimly achievable.

+7
source share
4 answers

Since Apex syntax is similar to Java, I would take a look at Eclipse JDT. Edit your Java grammar to match Apex. Follow the same rules / formatting options. This is more than a few days of work.

+1
source

Stephen Herod wrote:

... I suppose you can "train" antlr to understand the syntax of Apex ....

What do you mean by "train 'antlr"? "Train" as in artificial intelligence (neural network training)? If so, then you are mistaken.

Stephen Herod wrote:

... get a structured version of the code in the data structure (AST?), which I could then execute to create correctly formatted code.

Is this the right concept? Is Antlr a tool for this?

Yes, more or less. You write a grammar that pinpoints the language you want to parse. Then you use ANTLR, which will generate a lexer (tokenizer) and a parser based on the grammar file. You can let the parser create an AST from your input source, and then go through the AST and emit (custom) output / code.

Stephen Herod wrote:

... I am looking to invest a few days in this task, not months, and I'm not sure that it is even dimly achievable.

Well, I don’t know you, of course, but I would say writing a grammar for a language like Java, and then emitting output while walking around the AST for several days is impossible, especially for someone new to ANTLR. I am quite familiar with ANTLR, but I could not do this in just a few days. Please note that I am only talking about the "parsing part", after you have done this, you will need to integrate this into some kind of text editor. All this looks more like a project of several months, even weeks, not to mention a few days.

So, in short, if all you want to do is write a special marker, ANTLR is not the best choice.

You can see the Xtext , which uses ANTLR under the hood. To quote their website:

With Xtext, you can easily create your own programming languages ​​and domain languages ​​(DSL). This platform supports the development of language infrastructures, including compilers and interpreters, as well as fully integrated Eclipse-based IDE integration ....

But I doubt that you will have the Eclipse plugin and run for several days.

Anyway, good luck!

+2
source

Our DMS Software Reengineering Toolkit is designed to make it like a good poker bank, necessary for the implementation of any software reengineering project.

DMS allows you to define a grammar similar to the styles of ANTLR (and another parser). Unlike ANTLR (and other parser generators), DMS uses the GLR parser, which means you don’t have to bend the language rules of the grammar according to the requirements of the parser generator. If you can write context-free grammar, DMS will convert it to a parser for this language. This means that in fact you can get a working correct grammar much faster than with typical LL or L (AL) R parser generators.

Unlike ANTLR (and other parser generators), there is no additional work to create an AST; it is automatically built. This means that you spend zero time creating rules for constructing a tree and do not debug them.

In addition, DMS provides a print language with special prints by specifying a stack of text fields vertically, horizontally or indented, in which you can define the “format” that is used to convert AST to a completely legitimate, well-formatted source text. None of the famous parser generators provide any help here; if you want to print a tree, you can make a large number of user codes. See my SO answer for Compiling AST back to source for more on this. This means that you can build a pretty translator for your grammar on an (intensive) day by simply annotating the grammar rules with box layout directives.

DMS lexer is very careful to write down comments and "lexical formats" (was this number octal? What kind of quotes had this line?) So that they can be correctly restored. Parse-to-AST and then prettyprint-AST-to-text make round trips of arbitrarily ugly code to formatted code, following the rules of prettyprinting. (This round trip is a poker ante: if you want to go further to actually manage the AST, you still want to restore the actual source text).

We recently built a parser / prettyprinters for EGL. It took about a week to finish. Of course, we are experts in our tools.

You can download any of several formatters created using DMS from our website to find out what formatting can do.

EDIT July 2012: last week (5 days) using DMS, we (personally) built a fully compatible IEC61131-3 parser "Structured Text" (Industrial Control Language, Pascal-like) and a nice printer. (It processes all examples from standards documents).

0
source

Reverse engineering a language to get a parser is difficult. Very difficult! Even if it is very close to Java.

But why reinvent the wheel?

There is a great implementation of Apex parsing as part of the Force.com IDE on GitHub. It's just a jar without source code, but you can use it for anything. And the developers behind it really support and help .

We are currently creating the Apex module of the well-known static Java code analyzer PMD . And we use the Salesforce.com internal parser. It works great.

And hey, this is an open source project, and we need some contributors; -)

0
source

All Articles