Massive amount of XML changes

I need to load a medium-sized XML file into memory, make many random access changes to the file (possibly hundreds of thousands), and then write the result to STDIO. Most of these changes will include insert / delete node as well as insert / delete characters in text nodes. These XML files will be small enough to fit in memory, but large enough that I don't want to store multiple copies.

I am trying to install architecture / libraries and am looking for suggestions.

Here is what I have come up with so far -

I am looking for the perfect XML library for this, and so far I have not found anything like an account. Libraries typically store nodes in Haskell lists and text in Haskell Data.Text objects. This allows only linear nodes and text inserts, and I believe that text inserts will have to be completely rewritten for each insert / delete.

I think that storing both nodes and text in a sequence seems to be the way ... It supports inserting and deleting logs (N) and only requires rewriting a small part of the tree with each change. However, none of the XML libraries is based on this, so I will either have to write my own library, or just use one of the other libraries to parse and then convert it to my own form (given how easy it is to parse the XML, I almost like that as fast as the first, not the shadowy disassembly of everything).

I briefly considered the possibility that this may be a rare case when Haskell may not be the best tool ... But then I realized that variability here does not give much advantage, because my modifications are not char replaces, but rather add / remove . If I wrote this in C, I would still need to store the rows / nodes in some kind of tree structure in order to avoid large byte movements for each insert / delete. (Actually, Haskell probably has some of the best tools to solve this problem, but I would be open to suggesting choosing a language for this task if you feel that there is one).

Summarizing -

  • Is Haskell the right choice for this?

  • Does any Haskell lib support fast node / text inserts / deletes (log (N))?

  • Is a sequence the best data structure for storing a list of elements (in my case, nodes and characters) for quick insertion and deletion?

+2
haskell
source share
1 answer

I will answer my question -

I decided to wrap the Text.XML tree with a custom object that stores nodes and text in Data.Sequence objects. Since haskell is lazy, I believe that it only temporarily stores Text.XML data in memory, node on node in the data stream, and then it collects garbage before I actually start any real work by modifying the sequence trees.

(It would be nice if someone here could make sure that this is how Haskell will work inside, but I implemented things and the performance seems reasonable, and not great - at 30 thousand insertions / deletes per second, but this should do).

+1
source share

All Articles