Parsing large pseudo-xml files in python

I am trying to parse a * large file (> 5 GB) of structured markup data. The data format is essentially XML, but there is no explicit root element. What is the most effective way to do this?

The problem with SAX parsers is that they require a root element, so either I have to add pseudo-potent to the data stream (is there a Java equivalent of SequenceInputStream in Python?), Or should I switch to a non-SAX-compatible event analyzer (is there sgmllib successor?)

The data structure is pretty simple. Basically a list of items:

<Document>
  <docid>1</docid>
  <text>foo</text>
</Document>
<Document>
  <docid>2</docid>
  <text>bar</text>
</Document>

* actually for iteration

+5
source share
4 answers

http://docs.python.org/library/xml.sax.html

, 'stream' xml.sax.parse. , , , , , (, read), parse... , , , . , read... .

, :

import xml.sax
import xml.sax.handler

class PseudoStream(object):
    def read_iterator(self):
        yield '<foo>'
        yield '<bar>'
        for line in open('test.xml'):
            yield line
        yield '</bar>'
        yield '</foo>'

    def __init__(self):
        self.ri = self.read_iterator()

    def read(self, *foo):
        try:
            return self.ri.next()
        except StopIteration:
            return ''

class SAXHandler(xml.sax.handler.ContentHandler):
    def startElement(self, name, attrs):
        print name, attrs

d = xml.sax.parse(PseudoStream(), SAXHandler())
+11

( String), XML.

.

+1

SAX, STax VTD-XML.

+1

xml.parsers.expat - XML Expat xml.parsers.expat Python XML- Expat. xmlparser, XML. xmlparser . XML- , XML.

: http://www.python.org/doc/2.5/lib/module-xml.parsers.expat.html

0

All Articles