Reading an XML file while it is being written (in Python)

I need to track an XML file written by a tool that runs all day. But the XML file is correctly completed and closed only at the end of the day.

Same limitations as XML stream processing:

  • Parsing an incomplete XML file on the fly and trigger actions
  • Keep track of the last position in the file to avoid reprocessing it from the very beginning.

In response You need to read XML files as a stream using BeautifulSoup in Python , slezica offers xml.sax , xml.etree.ElementTree and cElementTree . But no success in my attempts to use xml.etree.ElementTree and cElementTree . There are also xml.dom , xml.parsers.expat and lxml , but I do not see support for on-the-fly parsing.

I need more obvious examples ...

I am currently using Python 2.7 for Linux, but I will upgrade to Python 3.x => and also introduce new features in Python 3.x. I also use watchdog to detect changes to the XML file => If necessary, reuse watchdog . Optionally also supports Windows.

Provide easy-to-understand / maintainable solutions. If it's too complicated, I can just use tell() / seek() to move around inside the file, use silly text search in raw XML and finally retrieve the values ​​using the main regular expression.


XML example:

 <dfxml xmloutputversion='1.0'> <creator version='1.0'> <program>TCPFLOW</program> <version>1.4.6</version> </creator> <configuration> <fileobject> <filename>file1</filename> <filesize>288</filesize> <tcpflow packets='12' srcport='1111' dstport='2222' family='2' /> </fileobject> <fileobject> <filename>file2</filename> <filesize>352</filesize> <tcpflow packets='12' srcport='3333' dstport='4444' family='2' /> </fileobject> <fileobject> <filename>file3</filename> <filesize>456</filesize> ... ... 

Failed to run the first test using SAX:

 import xml.sax class StreamHandler(xml.sax.handler.ContentHandler): def startElement(self, name, attrs): print 'start: name=', name def endElement(self, name): print 'end: name=', name if name == 'root': raise StopIteration if __name__ == '__main__': parser = xml.sax.make_parser() parser.setContentHandler(StreamHandler()) with open('f.xml') as f: parser.parse(f) 

Shell:

 $ while read line; do echo $line; sleep 1; done <i.xml >f.xml & ... $ ./test-using-sax.py start: name= dfxml start: name= creator start: name= program end: name= program start: name= version end: name= version Traceback (most recent call last): File "./test-using-sax.py", line 17, in <module> parser.parse(f) File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 107, in parse xmlreader.IncrementalParser.parse(self, source) File "/usr/lib64/python2.7/xml/sax/xmlreader.py", line 125, in parse self.close() File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 220, in close self.feed("", isFinal = 1) File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 214, in feed self._err_handler.fatalError(exc) File "/usr/lib64/python2.7/xml/sax/handler.py", line 38, in fatalError raise exception xml.sax._exceptions.SAXParseException: report.xml:15:0: no element found 
0
stream xml-parsing on-the-fly
source share
2 answers

Since yesterday, I found Peter Gibson's answer about undocumented xml.etree.ElementTree.XMLTreeBuilder._parser.EndElementHandler .

This example is similar to another, but uses xml.etree.ElementTree (and watchdog ).

Doesn't work when ElementTree is replaced with cElementTree : - /

 import time import watchdog.events import watchdog.observers import xml.etree.ElementTree class XmlFileEventHandler(watchdog.events.PatternMatchingEventHandler): def __init__(self): watchdog.events.PatternMatchingEventHandler.__init__(self, patterns=['*.xml']) self.xml_file = None self.parser = xml.etree.ElementTree.XMLTreeBuilder() def end_tag_event(tag): node = self.parser._end(tag) print 'tag=', tag, 'node=', node self.parser._parser.EndElementHandler = end_tag_event def on_modified(self, event): if not self.xml_file: self.xml_file = open(event.src_path) buffer = self.xml_file.read() if buffer: self.parser.feed(buffer) if __name__ == '__main__': observer = watchdog.observers.Observer() event_handler = XmlFileEventHandler() observer.schedule(event_handler, path='.') try: observer.start() while True: time.sleep(10) finally: observer.stop() observer.join() 

While the script is running, do not forget to touch single XML file or simulate the record "on the fly" using this script line:

 while read line; do echo $line; sleep 1; done <in.xml >out.xml & 

For information, xml.etree.ElementTree.iterparse does not seem to support the file that is being written. My test code is:

 from __future__ import print_function, division import xml.etree.ElementTree if __name__ == '__main__': context = xml.etree.ElementTree.iterparse('f.xml', events=('end',)) for action, elem in context: print(action, elem.tag) 

My conclusion:

 end program end version end creator end filename end filesize end tcpflow end fileobject end filename end filesize end tcpflow end fileobject end filename end filesize Traceback (most recent call last): File "./iter.py", line 9, in <module> for action, elem in context: File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1281, in next self._root = self._parser.close() File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1654, in close self._raiseerror(v) File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror raise err xml.etree.ElementTree.ParseError: no element found: line 20, column 0 
+1
source share

Three hours after sending my question, no answer was received. But I finally implemented the simple example that I was looking for.

My inspiration is from saaj answer and is based on xml.sax and watchdog .

 from __future__ import print_function, division import time import watchdog.events import watchdog.observers import xml.sax class XmlStreamHandler(xml.sax.handler.ContentHandler): def startElement(self, tag, attributes): print(tag, 'attributes=', attributes.items()) self.tag = tag def characters(self, content): print(self.tag, 'content=', content) class XmlFileEventHandler(watchdog.events.PatternMatchingEventHandler): def __init__(self): watchdog.events.PatternMatchingEventHandler.__init__(self, patterns=['*.xml']) self.file = None self.parser = xml.sax.make_parser() self.parser.setContentHandler(XmlStreamHandler()) def on_modified(self, event): if not self.file: self.file = open(event.src_path) self.parser.feed(self.file.read()) if __name__ == '__main__': observer = watchdog.observers.Observer() event_handler = XmlFileEventHandler() observer.schedule(event_handler, path='.') try: observer.start() while True: time.sleep(10) finally: observer.stop() observer.join() 

While the script is running, do not forget to touch single XML file or simulate a record on the fly using the following command:

 while read line; do echo $line; sleep 1; done <in.xml >out.xml & 
0
source share

All Articles