Loading local fragments into the DOM when parsing a large XML file in SAX (Java)

I have an xml file that I would avoid loading everything into memory. As everyone knows, for such a file, I’d better use a SAX analyzer (which will go through the file and trigger events if something relevant is found.)

My current problem is that I would like to process the file in a piece, which means:

  • Parse the file and find the corresponding tag (node)
  • Load this tag entirely into memory (for example, we will do it in the DOM)
  • Does the process execute this object (this local fragment)
  • When I finished with a piece, release it and continue to 1. (to the "end of file")

In an ideal world, I am looking for something like this:

// 1. Create a parser and set the file to load
      IdealParser p = new IdealParser("BigFile.xml");
// 2. Set an XPath to define the interesting nodes
      p.setRelevantNodesPath("/path/to/relevant/nodes");
// 3. Add a handler to callback the right method once a node is found
      p.setHandler(new Handler(){
// 4. The method callback by the parser when a relevant node is found
      void aNodeIsFound(saxNode aNode)
   {
   // 5. Inflate the current node i.e. load it (and all its content) in memory
         DomNode d = aNode.expand();
   // 6. Do something with the inflated node (method to be defined somewhere)
         doThingWithNode(d);
    }
   });
// 7. Start the parser
      p.start();

"sax node" ( ...) .

- Java, ?

+5
4

ok , , , :

:

try 
        {
            /* CREATE THE PARSER  */
            XMLParser parser      = new XMLParser();
            /* CREATE THE FILTER (THIS IS A REGEX (X)PATH FILTER) */
            XMLRegexFilter filter = new XMLRegexFilter("statements/statement");
            /* CREATE THE HANDLER WHICH WILL BE CALLED WHEN A NODE IS FOUND */
            XMLHandler handler    = new XMLHandler()
            {
                public void nodeFound(StringBuilder node, XMLStackFilter withFilter)
                {
                    // DO SOMETHING WITH THE FOUND XML NODE
                    System.out.println("Node found");
                    System.out.println(node.toString());
                }
            };
            /* ATTACH THE FILTER WITH THE HANDLER */
            parser.addFilterWithHandler(filter, handler);
            /* SET THE FILE TO PARSE */
            parser.setFilePath("/path/to/bigfile.xml");
            /* RUN THE PARSER */
            parser.parse();
        } 
        catch (Exception ex) 
        {
            ex.printStackTrace();
        }

:

  • XMLNodeFoundNotifier XMLStackFilter, /.
  • . .
  • , .
  • , ,

:

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Stack;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javax.xml.stream.*;

/* IMPLEMENT THIS TO YOUR CLASS IN ORDER TO TO BE NOTIFIED WHEN A NODE IS FOUND*/
interface XMLNodeFoundNotifier {

    abstract void nodeFound(StringBuilder node, XMLStackFilter withFilter);
}

/* A SMALL HANDER USEFULL FOR EXPLICIT CLASS DECLARATION */
abstract class XMLHandler implements XMLNodeFoundNotifier {
}

/* INTERFACE TO WRITE YOUR OWN FILTER BASED ON THE CURRENT NODES STACK (PATH)*/
interface XMLStackFilter {

    abstract boolean isRelevant(Stack fullPath);
}

/* A VERY USEFULL FILTER USING REGEX AS THE PATH FILTER */
class XMLRegexFilter implements XMLStackFilter {

    Pattern relevantExpression;

    XMLRegexFilter(String filterRules) {
        relevantExpression = Pattern.compile(filterRules);
    }

    /* HERE WE ARE ARE ASK TO TELL IF THE CURRENT STACK (LIST OF NODES) IS RELEVANT
     * OR NOT ACCORDING TO WHAT WE WANT. RETURN TRUE IF THIS IS THE CASE */
    @Override
    public boolean isRelevant(Stack fullPath) {
        /* A POSSIBLE CLEVER WAY COULD BE TO SERIALIZE THE WHOLE PATH (INCLUDING
         * ATTRIBUTES) TO A STRING AND TO MATCH IT WITH A REGEX BEING THE FILTER
         * FOR NOW StackToString DOES NOT SERIALIZE ATTRIBUTES */
        String stackPath = XMLParser.StackToString(fullPath);
        Matcher m = relevantExpression.matcher(stackPath);
        return  m.matches();
    }
}

/* THE MAIN PARSER CLASS */
public class XMLParser {

    HashMap<XMLStackFilter, XMLNodeFoundNotifier> filterHandler;
    HashMap<Integer, Integer> feedingStreams;
    Stack<HashMap> currentStack;
    String filePath;

    XMLParser() {
        currentStack   = new <HashMap>Stack();
        filterHandler  = new <XMLStackFilter, XMLNodeFoundNotifier> HashMap();
        feedingStreams = new <Integer, Integer>HashMap();
    }

    public void addFilterWithHandler(XMLStackFilter f, XMLNodeFoundNotifier h) {
        filterHandler.put(f, h);
    }

    public void setFilePath(String filePath) {
        this.filePath = filePath;
    }

    /* CONVERT A STACK OF NODES TO A REGULAR PATH STRING. NOTE THAT PER DEFAULT 
     * I DID NOT ADDED THE ATTRIBUTES INTO THE PATH. UNCOMENT THE LINKS ABOVE TO
     * DO SO
     */
    public static String StackToString(Stack<HashMap> s) {
        int k = s.size();
        if (k == 0) {
            return null;
        }
        StringBuilder out = new StringBuilder();
        out.append(s.get(0).get("tag"));
        for (int x = 1; x < k; ++x) {
            HashMap node = s.get(x);
            out.append('/').append(node.get("tag"));
            /* 
            // UNCOMMENT THIS TO ADD THE ATTRIBUTES SUPPORT TO THE PATH

            ArrayList <String[]>attributes = (ArrayList)node.get("attr");
            if (attributes.size()>0)
            {
            out.append("[");
            for (int i = 0 ; i<attributes.size(); i++)
            {
            String[]keyValuePair = attributes.get(i);
            if (i>0) out.append(",");
            out.append(keyValuePair[0]);
            out.append("=\"");
            out.append(keyValuePair[1]);
            out.append("\"");
            }
            out.append("]");
            }*/
        }
        return out.toString();
    }

    /*
     * ONCE A NODE HAS BEEN SUCCESSFULLY FOUND, WE GET THE DELIMITERS OF THE FILE
     * WE THEN RETRIEVE THE DATA FROM IT.
     */
    private StringBuilder getChunk(int from, int to) throws Exception {
        int length = to - from;
        FileReader f = new FileReader(filePath);
        BufferedReader br = new BufferedReader(f);
        br.skip(from);
        char[] readb = new char[length];
        br.read(readb, 0, length);
        StringBuilder b = new StringBuilder();
        b.append(readb);
        return b;
    }
    /* TRANSFORMS AN XSR NODE TO A HASHMAP NODE REPRESENTATION */
    public HashMap XSRNode2HashMap(XMLStreamReader xsr) {
        HashMap h = new HashMap();
        ArrayList attributes = new ArrayList();

        for (int i = 0; i < xsr.getAttributeCount(); i++) {
            String[] s = new String[2];
            s[0] = xsr.getAttributeName(i).toString();
            s[1] = xsr.getAttributeValue(i);
            attributes.add(s);
        }

        h.put("tag", xsr.getName());
        h.put("attr", attributes);

        return h;
    }

    public void parse() throws Exception {
        FileReader f         = new FileReader(filePath);
        XMLInputFactory xif  = XMLInputFactory.newInstance();
        XMLStreamReader xsr  = xif.createXMLStreamReader(f);
        Location previousLoc = xsr.getLocation();

        while (xsr.hasNext()) {
            switch (xsr.next()) {
                case XMLStreamConstants.START_ELEMENT:
                    currentStack.add(XSRNode2HashMap(xsr));
                    for (XMLStackFilter filter : filterHandler.keySet()) {
                        if (filter.isRelevant(currentStack)) {
                            feedingStreams.put(currentStack.hashCode(), new Integer(previousLoc.getCharacterOffset()));
                        }
                    }
                    previousLoc = xsr.getLocation();
                    break;

                case XMLStreamConstants.END_ELEMENT:
                    Integer stream = null;
                    if ((stream = feedingStreams.get(currentStack.hashCode())) != null) {
                        // FIND ALL THE FILTERS RELATED TO THIS FeedingStreem AND CALL THEIR HANDLER.
                        for (XMLStackFilter filter : filterHandler.keySet()) {
                            if (filter.isRelevant(currentStack)) {
                                XMLNodeFoundNotifier h = filterHandler.get(filter);

                                StringBuilder aChunk = getChunk(stream.intValue(), xsr.getLocation().getCharacterOffset());
                                h.nodeFound(aChunk, filter);
                            }
                        }
                        feedingStreams.remove(currentStack.hashCode());
                    }
                    previousLoc = xsr.getLocation();
                    currentStack.pop();
                    break;
                default:
                    break;
            }
        }
    }
}
+1

UPDATE

API javax.xml.xpath:

package forum7998733;

import java.io.FileReader;
import javax.xml.xpath.*;
import org.w3c.dom.Node;
import org.xml.sax.InputSource;

public class XPathDemo {

    public static void main(String[] args) throws Exception {
        XPathFactory xpf = XPathFactory.newInstance();
        XPath xpath = xpf.newXPath();
        InputSource xml = new InputSource(new FileReader("BigFile.xml"));
        Node result = (Node) xpath.evaluate("/path/to/relevant/nodes", xml, XPathConstants.NODE);
        System.out.println(result);
    }

}

, StAX.

Input.xml

XML:

<statements>
   <statement account="123">
      ...stuff...
   </statement>
   <statement account="456">
      ...stuff...
   </statement>
</statements>

Demo

StAX XMLStreamReader node, DOM. statement DOM, .

package forum7998733;

import java.io.FileReader;
import javax.xml.stream.*;
import javax.xml.transform.*;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.dom.*;

public class Demo {

    public static void main(String[] args) throws Exception  {
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("src/forum7998733/input.xml"));
        xsr.nextTag(); // Advance to statements element

        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer t = tf.newTransformer();
        while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
            DOMResult domResult = new DOMResult();
            t.transform(new StAXSource(xsr), domResult);

            DOMSource domSource = new DOMSource(domResult.getNode());
            StreamResult streamResult = new StreamResult(System.out);
            t.transform(domSource, streamResult);
        }
    }

}

<?xml version="1.0" encoding="UTF-8" standalone="no"?><statement account="123">
      ...stuff...
   </statement><?xml version="1.0" encoding="UTF-8" standalone="no"?><statement account="456">
      ...stuff...
   </statement>
+5

SAX... , StAX (Streaming API XML) . XMLEventReader , , . ( XPath, /) node, String . , " " .

XMLEvents , XMLEventWriter, - , , StringWriter ByteArrayOutputStream. , XML, "" , DOM, placeholder DocumentBuilder .

, XPath. , node, . , - XPath .

StAX , , , , SAX.

: XSLT. XSLT - . , . , . ( ) / .

, XSLT-. Java, , XML- , . , . DOM ( node), , . XSLT-.

, , . JAXP, Xerces + Xalan, .

XSLT XPath 1.0, , XSLT Java. , , - Java, . , . . .

, XML Java, , . , ... , - .

!

: , StAX, . , , , :

package staxdom;

import java.io.IOException;
import java.io.InputStream;
import java.io.StringReader;
import java.io.StringWriter;
import java.util.Collections;
import java.util.HashSet;
import java.util.Set;
import java.util.Stack;
import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class DOMExtractor {

    private final Set<String> paths;
    private final XMLInputFactory inputFactory;
    private final XMLOutputFactory outputFactory;
    private final DocumentBuilderFactory docBuilderFactory;
    private final Stack<QName> activeStack = new Stack<QName>();

    private boolean active = false;
    private String currentPath = "";

    public DOMExtractor(final Set<String> paths) {

        this.paths = Collections.unmodifiableSet(new HashSet<String>(paths));
        inputFactory = XMLInputFactory.newFactory();
        outputFactory = XMLOutputFactory.newFactory();
        docBuilderFactory = DocumentBuilderFactory.newInstance();

    }

    public void parse(final InputStream input) throws XMLStreamException, ParserConfigurationException, SAXException, IOException {

        final XMLEventReader reader = inputFactory.createXMLEventReader(input);
        XMLEventWriter writer = null;
        StringWriter buffer = null;
        final DocumentBuilder builder = docBuilderFactory.newDocumentBuilder();

        XMLEvent currentEvent = reader.nextEvent();

        do {

            if(active)
                writer.add(currentEvent);

            if(currentEvent.isEndElement()) {

                if(active) {

                    activeStack.pop();

                    if(activeStack.isEmpty()) {
                        writer.flush();
                        writer.close();
                        final Document doc;
                        final StringReader docReader = new StringReader(buffer.toString());
                        try {
                            doc = builder.parse(new InputSource(docReader));
                        } finally {
                            docReader.close();
                        }
                        //TODO: use doc
                        //Next bit is only for demo...
                        outputDoc(doc);
                        active = false;
                        writer = null;
                        buffer = null;
                    }

                }

                int index;
                if((index = currentPath.lastIndexOf('/')) >= 0)
                    currentPath = currentPath.substring(0, index);

            } else if(currentEvent.isStartElement()) {

                final StartElement start = (StartElement)currentEvent;
                final QName qName = start.getName();
                final String local = qName.getLocalPart();

                currentPath += "/" + local;

                if(!active && paths.contains(currentPath)) {

                    active = true;

                    buffer = new StringWriter();
                    writer = outputFactory.createXMLEventWriter(buffer);

                    writer.add(currentEvent);

                }

                if(active)
                    activeStack.push(qName);

            }

            currentEvent = reader.nextEvent();

        } while(!currentEvent.isEndDocument());

    }

    private void outputDoc(final Document doc) {


        try {
            final Transformer t = TransformerFactory.newInstance().newTransformer();
            t.transform(new DOMSource(doc), new StreamResult(System.out));
            System.out.println("");
            System.out.println("");
        } catch(TransformerException ex) {
            ex.printStackTrace();
        }

    }

    public static void main(String[] args) {

        final Set<String> paths = new HashSet<String>();
        paths.add("/root/one");
        paths.add("/root/three/embedded");

        final DOMExtractor me = new DOMExtractor(paths);

        InputStream stream = null;
        try {
            stream = DOMExtractor.class.getResourceAsStream("sample.xml");
            me.parse(stream);
        } catch(final Exception e) {
            e.printStackTrace();
        } finally {
            if(stream != null)
                try {
                    stream.close();
                } catch(IOException ex) {
                    ex.printStackTrace();
                }
        }

    }

}

sample.xml( ):

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <one>
        <two>this is text</two>
        look, I can even handle mixed!
    </one>
    ... not sure what to do with this, though
    <two>
        <willbeignored/>
    </two>
    <three>
        <embedded>
            <and><here><we><go>
                Creative Commons Legal Code

                Attribution 3.0 Unported

                    CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
                    LEGAL SERVICES. DISTRIBUTION OF THIS LICENSE DOES NOT CREATE AN
                    ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
                    INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
                    REGARDING THE INFORMATION PROVIDED, AND DISCLAIMS LIABILITY FOR
                    DAMAGES RESULTING FROM ITS USE.

                License

                THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS CREATIVE
                COMMONS PUBLIC LICENSE ("CCPL" OR "LICENSE"). THE WORK IS PROTECTED BY
                COPYRIGHT AND/OR OTHER APPLICABLE LAW. ANY USE OF THE WORK OTHER THAN AS
                AUTHORIZED UNDER THIS LICENSE OR COPYRIGHT LAW IS PROHIBITED.

                BY EXERCISING ANY RIGHTS TO THE WORK PROVIDED HERE, YOU ACCEPT AND AGREE
                TO BE BOUND BY THE TERMS OF THIS LICENSE. TO THE EXTENT THIS LICENSE MAY
                BE CONSIDERED TO BE A CONTRACT, THE LICENSOR GRANTS YOU THE RIGHTS
                CONTAINED HERE IN CONSIDERATION OF YOUR ACCEPTANCE OF SUCH TERMS AND
                CONDITIONS.
            </go></we></here></and>
        </embedded>
    </three>
</root>

EDIT 2: Blaise Doughan , StAXSource. . , StAX. . StAX "" , , , , .

+4

A bit since I did SAX, but what you want to do is process each of the tags until you find the end tag for the group you want to process, then start your process, clear it and find the next start tag.

0
source

All Articles