How to return data from a SAX Python analyzer?

I am trying to parse huge XML files that LXML will not validate, so I have to parse them using xml.sax.

class SpamExtractor(sax.ContentHandler):
    def startElement(self, name, attrs):
        if name == "spam":
            print("We found a spam!")
            # now what?

The problem is that I do not understand how it really is return, or better yield, what this handler finds for the caller, without waiting for the entire file to be analyzed. So far I have been dealing with threading.Threadand Queue.Queue, but this leads to all kinds of threading problems that really distract me from the actual problem I'm trying to solve.

I know that I can run the SAX parser in a separate process, but I believe that there should be an easier way to get the data. There is?

+5
source share
4

, , .

, xml.etree.ElementTree.iterparse, , , , :

XML , . source - , XML-. - . , "". parser . , XMLParser. , (event, elem) .

, , , , .

:

def find_spam(xml):
    for event, element in xml.etree.ElementTree.iterparse(xml):
        if element.tag == "spam":
            print("We found a spam!")
            # Potentially do something
            yield element

, . ElementTree - , SAX - , .

+6

, "" ContentHandler :

cosax.py:

import xml.sax

class EventHandler(xml.sax.ContentHandler):
    def __init__(self,target):
        self.target = target
    def startElement(self,name,attrs):
        self.target.send(('start',(name,attrs._attrs)))
    def characters(self,text):
        self.target.send(('text',text))
    def endElement(self,name):
        self.target.send(('end',name))

def coroutine(func):
    def start(*args,**kwargs):
        cr = func(*args,**kwargs)
        cr.next()
        return cr
    return start

# example use
if __name__ == '__main__':
    @coroutine
    def printer():
        while True:
            event = (yield)
            print event

    xml.sax.parse("allroutes.xml",
                  EventHandler(printer()))

, , self.target.send, printer event = (yield). event self.target.send, printer (yield), .

for-loop, (, printer) send.

+5

- SAX , .

:

class SpamExtractor(sax.ContentHandler):
    def __init__(self, canning_machine):
        self.canning_machine = canning_machine

    def startElement(self, name, attrs):
        if name == "spam":
            print("We found a spam!")
            self.canning_machine.can(name, attrs)
0

Basically, there are three ways to parse XML:

  • SAX -Approach: this is an implementation of the visitor template, the idea is that events are transferred to your code.
  • StAX -approach: you pull out the next item until you are ready (useful for partial parsing, i.e. just reading the SOAP header)
  • DOM -Approach, where you load everything into a tree in memory

It seems you need the second, but I'm not sure if it is somewhere in the standard library.

0
source

All Articles