Parsing large pseudo-xml files in python

Question

Parsing large pseudo-xml files in python

I am trying to parse a * large file (> 5 GB) of structured markup data. The data format is essentially XML, but there is no explicit root element. What is the most effective way to do this?

The problem with SAX parsers is that they require a root element, so either I have to add pseudo-potent to the data stream (is there a Java equivalent of SequenceInputStream in Python?), Or should I switch to a non-SAX-compatible event analyzer (is there sgmllib successor?)

The data structure is pretty simple. Basically a list of items:

<Document>
  <docid>1</docid>
  <text>foo</text>
</Document>
<Document>
  <docid>2</docid>
  <text>bar</text>
</Document>

* actually for iteration

+5

python xml

Peter Prettenhofer Oct 2 '09 at 11:15

source share

4 answers

( String), XML.

.

+1

ATorras 02 . '09 11:39

SAX, STax VTD-XML.

+1

vtd-xml-author 02 . '09 17:49

xml.parsers.expat - XML Expat xml.parsers.expat Python XML- Expat. xmlparser, XML. xmlparser . XML- , XML.

: http://www.python.org/doc/2.5/lib/module-xml.parsers.expat.html

0

hemanth.hm 02 . '09 18:17

liori · Accepted Answer · 2009-10-02T11:26:55+0000

http://docs.python.org/library/xml.sax.html

, 'stream' xml.sax.parse. , , , , , (, read), parse... , , , . , read... .

, :

import xml.sax
import xml.sax.handler

class PseudoStream(object):
    def read_iterator(self):
        yield '<foo>'
        yield '<bar>'
        for line in open('test.xml'):
            yield line
        yield '</bar>'
        yield '</foo>'

    def __init__(self):
        self.ri = self.read_iterator()

    def read(self, *foo):
        try:
            return self.ri.next()
        except StopIteration:
            return ''

class SAXHandler(xml.sax.handler.ContentHandler):
    def startElement(self, name, attrs):
        print name, attrs

d = xml.sax.parse(PseudoStream(), SAXHandler())

Parsing large pseudo-xml files in python

More articles: