How to remove xml elements / nodes from an xml file larger than the available RAM?

I am trying to figure out how to remove an element (and its children) from an XML file which is very large in php (latest version).

I know that I can use dom and simpleXml, but this will require the document to be loaded into memory.

I look at the XML functions writer / reader / parser and googling, but nothing seems to be happening on this issue (all answers recommend using dom or simpleXml). It can't be right - am I missing something?

The closest I found is (C #):

You can use XmlReader to read your xml sequentially (ReadOuterXml may be useful in your case to read the entire node at a time). Then use XmlWriter to write all the nodes that you want to save. ( Removing nodes from large XML files )

Really? Is this an approach? Do I need to copy the entire huge file?

Is there no other way?

Confirm

As suggested,

I could read the data using a php XML reader or parser, possibly its buffer, and also write / dump + add it back to the new file.

But is this approach practical?

I have experience splitting huge XML files into smaller parts, mainly using the proposed method, and it took a very long time to complete the process.

My dataset is currently not large enough to give me an idea of ​​how this will work. I could only assume that the results would be the same (very slow process).

Does anyone have experience putting this into practice?

+7
source share
1 answer

There are several ways to gradually process large documents, so you do not need to immediately load the entire structure into memory. In any case, yes, you will need to write back the elements you want to keep and omit the ones you want to delete.

  • PHP has an XMLReader implementation of pull parsing . explanation :

    A pull parser creates an iterator that sequentially visits the various elements, attributes, and data in the XML document. The code that uses this iterator can test the current element (tell, for example, about whether it is the start or end element or text) and check its attributes (local name, namespace, XML attribute values, text value, etc.) , and can also move the iterator to the next element. The code can thus extract information from a document as it passes through.

  • Or you can use SAX XML Parser . Explanation

    The Simple API for XML (SAX) is a lexical, event-driven interface in which a document is read in batches and its contents are reported as callbacks to various methods of a custom design handler. SAX is fast and efficient to implement, but it’s difficult to randomly extract information from XML, since it tends to burden the author of the application by tracking which part of the document is being processed.

Many people prefer the pull method, but either meet your requirements. Keep in mind that a large relative. If the document fits into memory, then it will almost always be easier to use the DOM. But really, really big documents that just might not be an option.

+3
source

All Articles