Xml.sax parser and line numbers, etc.

The challenge is to parse a simple XML document and parse the content by line number.

The correct Python package looks like xml.sax. But how to use it?

After some digging in the documentation, I found:

  • The interface xmlreader.Locatorhas the information: getLineNumber().
  • The interface handler.ContentHandlerhas setDocumentHandler().

The first thought was to create Locator, pass it in ContentHandlerand read information from the locator during calls to its methods character(), etc.

BUT, xmlreader.Locatorit is only a skeletal interface and can only return -1 from any of its methods. So, as a bad user, WHAT should I do without writing integers Parserand Locatormy own?

I will answer my question now.


(Well, I would like, apart from an arbitrary, annoying rule that says I can't.)


I was not able to figure this out using existing documentation (or web search), and was forced to read the source code for xml.sax(in / usr / lib / python2.7 / xml / sax / on my system).

The xml.sax make_parser()default function creates the real one Parser, but what is it?
In the source code, it is detected that this ExpatParseris defined in expatreader.py. And ... he has his own Locator, a ExpatLocator. But access to this thing is missing. Between this and the decision there were many scratches on the head.

  • write your own ContentHandlerthat knows about Locator and uses it to determine line numbers
  • create ExpatParserwithxml.sax.make_parser()
  • create ExpatLocatorby passing it an instance ExpatParser.
  • ContentHandler, ExpatLocator
  • ContentHandler setContentHandler()
  • parse() Parser.

:

import sys
import xml.sax

class EltHandler( xml.sax.handler.ContentHandler ):
    def __init__( self, locator ):
        xml.sax.handler.ContentHandler.__init__( self )
        self.loc = locator
        self.setDocumentLocator( self.loc )

    def startElement( self, name, attrs ): pass

    def endElement( self, name ): pass

    def characters( self, data ):
        lineNo = self.loc.getLineNumber()
        print >> sys.stdout, "LINE", lineNo, data

def spit_lines( filepath ):
    try:
        parser = xml.sax.make_parser()
        locator = xml.sax.expatreader.ExpatLocator( parser )
        handler = EltHandler( locator )
        parser.setContentHandler( handler )
        parser.parse( filepath )
    except IOError as e:
        print >> sys.stderr, e

if len( sys.argv ) > 1:
    filepath = sys.argv[1]
    spit_lines( filepath )
else:
    print >> sys.stderr, "Try providing a path to an XML file."

Martijn Pieters . ContentHandler , , , ._locator , Locator.

: Locator ( , ). : , .

Martijn!

+2
1

. , , . xml.sax.xmlreader.Locator - , , , ; - , 4 , .

, . XML- expat .

xml.sax.handler.ContentHandler(), setDocumentHandler(), .startDocument() on , self._locator set:

from xml.sax.handler import ContentHandler

class MyContentHandler(ContentHandler):
    def __init__(self):
        ContentHandler.__init__(self)
        # initialize your handler

    def startElement(self, name, attrs):
        loc = self._locator
        if loc is not None:
            line, col = loc.getLineNumber(), loc.getColumnNumber()
        else:
            line, col = 'unknown', 'unknown'
        print 'start of {} element at line {}, column {}'.format(name, line, col)
+4

All Articles