Is there a fast XML parser in Python that allows me to start a tag as a byte offset in the stream?

I work with potentially huge XML files containing complex trace information from my projects.

I would like to create indexes for these XML files so that I can quickly find subsections of the XML document without having to load them all into memory.

If I created a โ€œshelfโ€ index, which can contain information, for example, โ€œbooks for the author Joe,โ€ is located at offsets [22322, 35446, 54545], then I can simply open the XML file as a regular text file and look for those offsets. and then for one of the DOM parsers that takes a file or lines.

The part I haven't figured out yet is a quick XML parsing and creating such an index.

So, what do I need as a quick SAX analyzer that allows me to find the initial tag offset in a file along with launch events. Therefore, I can parse the XML subkey along with the start point in the document, extract the key information and save the key and offset it in the shelf index.

+4
source share
1 answer

Since locators return rows and column numbers instead of offsets, you need to wrap a bit to complete the line outlines - a simplified example (may have some offbyones; -) ...:

import cStringIO import re from xml import sax from xml.sax import handler relinend = re.compile(r'\n') txt = '''<foo> <tit>Bar</tit> <baz>whatever</baz> </foo>''' stm = cStringIO.StringIO(txt) class LocatingWrapper(object): def __init__(self, f): self.f = f self.linelocs = [] self.curoffs = 0 def read(self, *a): data = self.f.read(*a) linends = (m.start() for m in relinend.finditer(data)) self.linelocs.extend(x + self.curoffs for x in linends) self.curoffs += len(data) return data def where(self, loc): return self.linelocs[loc.getLineNumber() - 1] + loc.getColumnNumber() locstm = LocatingWrapper(stm) class Handler(handler.ContentHandler): def setDocumentLocator(self, loc): self.loc = loc def startElement(self, name, attrs): print '% s@ %s:%s (%s)' % (name, self.loc.getLineNumber(), self.loc.getColumnNumber(), locstm.where(self.loc)) sax.parse(locstm, Handler()) 

Of course, you donโ€™t need to keep all the lineles around - to save memory, you can reset the "old" ones (below the last request), but then you need to do linelocs dict, etc.

+3
source

Source: https://habr.com/ru/post/1314854/


All Articles