I need to track an XML file written by a tool that runs all day. But the XML file is correctly completed and closed only at the end of the day.
Same limitations as XML stream processing:
- Parsing an incomplete XML file on the fly and trigger actions
- Keep track of the last position in the file to avoid reprocessing it from the very beginning.
In response You need to read XML files as a stream using BeautifulSoup in Python , slezica offers xml.sax , xml.etree.ElementTree and cElementTree . But no success in my attempts to use xml.etree.ElementTree and cElementTree . There are also xml.dom , xml.parsers.expat and lxml , but I do not see support for on-the-fly parsing.
I need more obvious examples ...
I am currently using Python 2.7 for Linux, but I will upgrade to Python 3.x => and also introduce new features in Python 3.x. I also use watchdog to detect changes to the XML file => If necessary, reuse watchdog . Optionally also supports Windows.
Provide easy-to-understand / maintainable solutions. If it's too complicated, I can just use tell() / seek() to move around inside the file, use silly text search in raw XML and finally retrieve the values ββusing the main regular expression.
XML example:
<dfxml xmloutputversion='1.0'> <creator version='1.0'> <program>TCPFLOW</program> <version>1.4.6</version> </creator> <configuration> <fileobject> <filename>file1</filename> <filesize>288</filesize> <tcpflow packets='12' srcport='1111' dstport='2222' family='2' /> </fileobject> <fileobject> <filename>file2</filename> <filesize>352</filesize> <tcpflow packets='12' srcport='3333' dstport='4444' family='2' /> </fileobject> <fileobject> <filename>file3</filename> <filesize>456</filesize> ... ...
Failed to run the first test using SAX:
import xml.sax class StreamHandler(xml.sax.handler.ContentHandler): def startElement(self, name, attrs): print 'start: name=', name def endElement(self, name): print 'end: name=', name if name == 'root': raise StopIteration if __name__ == '__main__': parser = xml.sax.make_parser() parser.setContentHandler(StreamHandler()) with open('f.xml') as f: parser.parse(f)
Shell:
$ while read line; do echo $line; sleep 1; done <i.xml >f.xml & ... $ ./test-using-sax.py start: name= dfxml start: name= creator start: name= program end: name= program start: name= version end: name= version Traceback (most recent call last): File "./test-using-sax.py", line 17, in <module> parser.parse(f) File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 107, in parse xmlreader.IncrementalParser.parse(self, source) File "/usr/lib64/python2.7/xml/sax/xmlreader.py", line 125, in parse self.close() File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 220, in close self.feed("", isFinal = 1) File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 214, in feed self._err_handler.fatalError(exc) File "/usr/lib64/python2.7/xml/sax/handler.py", line 38, in fatalError raise exception xml.sax._exceptions.SAXParseException: report.xml:15:0: no element found