What did you not understand about DMS ?
He exists.
It has an accurate parser / interface compiler for C, C ++, Java, C #, COBOL (and many other languages) .
It automatically creates complete abstract syntax trees for what it parses. Each AST node has a file / line / column stamp for the token that represents the beginning of this node, and the last column can be calculated by calling the DMS API.
It has a built-in option for generating XML from AST, complete with node type, source position (as above) and any associated literal value. Command line call:
run DMSDomainParser ++XML <path_to_your_file>
You can see what this XML result for Java looks like .
You probably don't really want what you want. A 1000 C program can contain 100K lines of #include files. A line creates between 5-10 nodes. The DMS XML output is succint, and each node only accepts a string, so you look through ~~ 1 million XML lines, 60 characters each - 60 million characters. This is a large file, and you probably do not want to process it with an XML tool.
DMS itself provides a huge infrastructure for managing the created AST: intersection, pattern matching (based on patterns encoded essentially in the original form), source-to-source conversion, control flow, data flow, point analysis, global call schedules. It is surprisingly difficult to reproduce all this technique, and you will probably need to do something interesting.
Morality: It is much better to use something like DMS to directly manage AST than to combat XML.
Full disclosure: I am an architect behind DMS.
Ira Baxter
source share