Using lxml memory while analyzing huge xml in python

Question

Using lxml memory while analyzing huge xml in python

I am new to python. I am trying to parse a huge XML file in my python module using lxml. Despite clearing the elements at the end of each cycle, my memory rises and crashes from the application. I'm sure something is missing here. Please help find out what it is.

Listed below are the main features that I use -

from lxml import etree def parseXml(context,attribList): for _, element in context: fieldMap={} rowList=[] readAttribs(element,fieldMap,attribList) readAllChildren(element,fieldMap,attribList) for row in rowList: yield row element.clear() def readAttribs(element,fieldMap,attribList): for atrrib in attribList: fieldMap[attrib]=element.get(attrib,'') def readAllChildren(element,fieldMap,attribList,rowList): for childElem in element: readAttribs(childEleme,fieldMap,attribList) if len(childElem) > 0: readAllChildren(childElem,fieldMap,attribList) rowlist.append(fieldMap.copy()) childElem.clear() def main(): attribList=['name','age','id'] context=etree.iterparse(fullFilePath, events=("start",)) for row in parseXml(context,attribList) print row

Thanks!!

Xml example and nested dictionary -

 <root xmlns='NS'> <Employee Name="Mr.ZZ" Age="30"> <Experience TotalYears="10" StartDate="2000-01-01" EndDate="2010-12-12"> <Employment id = "1" EndTime="ABC" StartDate="2000-01-01" EndDate="2002-12-12"> <Project Name="ABC_1" Team="4"> </Project> </Employment> <Employment id = "2" EndTime="XYZ" StartDate="2003-01-01" EndDate="2010-12-12"> <PromotionStatus>Manager</PromotionStatus> <Project Name="XYZ_1" Team="7"> <Award>Star Team Member</Award> </Project> </Employment> </Experience> </Employee> </root> ELEMENT_NAME='element_name' ELEMENTS='elements' ATTRIBUTES='attributes' TEXT='text' xmlDef={ 'namespace' : 'NS', 'content' : { ELEMENT_NAME: 'Employee', ELEMENTS: [{ELEMENT_NAME: 'Experience', ELEMENTS: [{ELEMENT_NAME: 'Employment', ELEMENTS: [{ ELEMENT_NAME: 'PromotionStatus', ELEMENTS: [], ATTRIBUTES:[], TEXT:['PromotionStatus'] }, { ELEMENT_NAME: 'Project', ELEMENTS: [{ ELEMENT_NAME: 'Award', ELEMENTS: {}, ATTRIBUTES:[], TEXT:['Award'] }], ATTRIBUTES:['Name','Team'], TEXT:[] }], ATTRIBUTES: ['TotalYears','StartDate','EndDate'], TEXT:[] }], ATTRIBUTES: ['TotalYears','StartDate','EndDate'], TEXT:[] }], ATTRIBUTES: ['Name','Age'], TEXT:[] } }

+4

python lxml

Rinks Jun 20 '11 at 23:07

source share

1 answer

Greg haskins · Accepted Answer · 2011-06-21T05:47:38+0000

Welcome to Python and stack overflow!

It sounds like you have been closely following lxml and especially etree.iterparse(..) , but I think your implementation is approaching a problem with the wrong angle. The idea of iterparse(..) is to get away from collecting and storing data and instead process tags as they read. Your readAllChildren(..) function stores everything on a rowList , which grows and grows to cover the entire document tree. I made a few changes to show what happens:

 from lxml import etree def parseXml(context,attribList): for event, element in context: print "%s element %s:" % (event, element) fieldMap = {} rowList = [] readAttribs(element, fieldMap, attribList) readAllChildren(element, fieldMap, attribList, rowList) for row in rowList: yield row element.clear() def readAttribs(element, fieldMap, attribList): for attrib in attribList: fieldMap[attrib] = element.get(attrib,'') print "fieldMap:", fieldMap def readAllChildren(element, fieldMap, attribList, rowList): for childElem in element: print "Found child:", childElem readAttribs(childElem, fieldMap, attribList) if len(childElem) > 0: readAllChildren(childElem, fieldMap, attribList, rowList) rowList.append(fieldMap.copy()) print "len(rowList) =", len(rowList) childElem.clear() def process_xml_original(xml_file): attribList=['name','age','id'] context=etree.iterparse(xml_file, events=("start",)) for row in parseXml(context,attribList): print "Row:", row

Work with some dummy data:

 >>> from cStringIO import StringIO >>> test_xml = """\ ... <family> ... <person name="somebody" id="5" /> ... <person age="45" /> ... <person name="Grandma" age="62"> ... <child age="35" id="10" name="Mom"> ... <grandchild age="7 and 3/4" /> ... <grandchild id="12345" /> ... </child> ... </person> ... <something-completely-different /> ... </family> ... """ >>> process_xml_original(StringIO(test_xml)) start element: <Element family at 0x105ca58> fieldMap: {'age': '', 'name': '', 'id': ''} Found child: <Element person at 0x105ca80> fieldMap: {'age': '', 'name': 'somebody', 'id': '5'} len(rowList) = 1 Found child: <Element person at 0x105c468> fieldMap: {'age': '45', 'name': '', 'id': ''} len(rowList) = 2 Found child: <Element person at 0x105c7b0> fieldMap: {'age': '62', 'name': 'Grandma', 'id': ''} Found child: <Element child at 0x106e468> fieldMap: {'age': '35', 'name': 'Mom', 'id': '10'} Found child: <Element grandchild at 0x106e148> fieldMap: {'age': '7 and 3/4', 'name': '', 'id': ''} len(rowList) = 3 Found child: <Element grandchild at 0x106e490> fieldMap: {'age': '', 'name': '', 'id': '12345'} len(rowList) = 4 len(rowList) = 5 len(rowList) = 6 Found child: <Element something-completely-different at 0x106e4b8> fieldMap: {'age': '', 'name': '', 'id': ''} len(rowList) = 7 Row: {'age': '', 'name': 'somebody', 'id': '5'} Row: {'age': '45', 'name': '', 'id': ''} Row: {'age': '7 and 3/4', 'name': '', 'id': ''} Row: {'age': '', 'name': '', 'id': '12345'} Row: {'age': '', 'name': '', 'id': '12345'} Row: {'age': '', 'name': '', 'id': '12345'} Row: {'age': '', 'name': '', 'id': ''} start element: <Element person at 0x105ca80> fieldMap: {'age': '', 'name': '', 'id': ''} start element: <Element person at 0x105c468> fieldMap: {'age': '', 'name': '', 'id': ''} start element: <Element person at 0x105c7b0> fieldMap: {'age': '', 'name': '', 'id': ''} start element: <Element child at 0x106e468> fieldMap: {'age': '', 'name': '', 'id': ''} start element: <Element grandchild at 0x106e148> fieldMap: {'age': '', 'name': '', 'id': ''} start element: <Element grandchild at 0x106e490> fieldMap: {'age': '', 'name': '', 'id': ''} start element: <Element something-completely-different at 0x106e4b8> fieldMap: {'age': '', 'name': '', 'id': ''}

It's a little hard to read, but you can see how it goes up the whole tree from the root tag in the first pass, creating a rowList for every element in the whole document. You will also notice that it does not even stop there, since the call to element.clear() appears after the yield statment in parseXml(..) , it does not start until the second iteration (i.e. the next element in the tree).

FTW incremental processing

A simple solution is to let iterparse(..) do its job: analyze iteratively! Below you will receive the same information and process it step by step:

 def do_something_with_data(data): """This just prints it out. Yours will probably be more interesting.""" print "Got data: ", data def process_xml_iterative(xml_file): # by using the default 'end' event, you start at the _bottom_ of the tree ATTRS = ('name', 'age', 'id') for event, element in etree.iterparse(xml_file): print "%s element: %s" % (event, element) data = {} for attr in ATTRS: data[attr] = element.get(attr, u"") do_something_with_data(data) element.clear() del element # for extra insurance

Working with the same dummy XML:

 >>> print test_xml <family> <person name="somebody" id="5" /> <person age="45" /> <person name="Grandma" age="62"> <child age="35" id="10" name="Mom"> <grandchild age="7 and 3/4" /> <grandchild id="12345" /> </child> </person> <something-completely-different /> </family> >>> process_xml_iterative(StringIO(test_xml)) end element: <Element person at 0x105cc10> Got data: {'age': u'', 'name': 'somebody', 'id': '5'} end element: <Element person at 0x106e468> Got data: {'age': '45', 'name': u'', 'id': u''} end element: <Element grandchild at 0x106e148> Got data: {'age': '7 and 3/4', 'name': u'', 'id': u''} end element: <Element grandchild at 0x106e490> Got data: {'age': u'', 'name': u'', 'id': '12345'} end element: <Element child at 0x106e508> Got data: {'age': '35', 'name': 'Mom', 'id': '10'} end element: <Element person at 0x106e530> Got data: {'age': '62', 'name': 'Grandma', 'id': u''} end element: <Element something-completely-different at 0x106e558> Got data: {'age': u'', 'name': u'', 'id': u''} end element: <Element family at 0x105c6e8> Got data: {'age': u'', 'name': u'', 'id': u''}

This will greatly improve the speed and memory performance of your script. In addition, by attaching an 'end' event, you can clear and delete items in the process, rather than waiting for all children to be processed.

Depending on your dataset, it might be a good idea to only process certain types of items. The root element for one is probably not very significant, and other nested elements can also fill your dataset with a lot of {'age': u'', 'id': u'', 'name': u''} .

Or with SAX

Aside, when I read “XML” and “low memory”, my mind always jumps directly to SAX , which is another way you can attack this problem. Using the xml.sax built-in module:

 import xml.sax class AttributeGrabber(xml.sax.handler.ContentHandler): """SAX Handler which will store selected attribute values.""" def __init__(self, target_attrs=()): self.target_attrs = target_attrs def startElement(self, name, attrs): print "Found element: ", name data = {} for target_attr in self.target_attrs: data[target_attr] = attrs.get(target_attr, u"") # (no xml trees or elements created at all) do_something_with_data(data) def process_xml_sax(xml_file): grabber = AttributeGrabber(target_attrs=('name', 'age', 'id')) xml.sax.parse(xml_file, grabber)

You will need to evaluate both options based on what works best in your situation (and maybe run a couple of tests if that is what you will often do).

Be sure to keep track of how things are going!

Edit based on subsequent comments

Implementing one of the above solutions may require some changes to the general structure of your code, but everything you have should be feasible. For example, when processing "rows" in batches, you could:

 def process_xml_batch(xml_file, batch_size=10): ATTRS = ('name', 'age', 'id') batch = [] for event, element in etree.iterparse(xml_file): data = {} for attr in ATTRS: data[attr] = element.get(attr, u"") batch.append(data) element.clear() del element if len(batch) == batch_size: do_something_with_batch(batch) # Or, if you want this to be a genrator: # yield batch batch = [] if batch: # there are leftover items do_something_with_batch(batch) # Or, yield batch

Using lxml memory while analyzing huge xml in python

Welcome to Python and stack overflow!

FTW incremental processing

Or with SAX

Edit based on subsequent comments

More articles: