I have a series of large XML files (~ 3 GB each) that I am trying to process. Rough XML format
<FILE>
<DOC>
<FIELD1>
Some text.
</FIELD1>
<FIELD2>
Some text. Probably some more fields nested within this one.
</FIELD2>
<FIELD3>
Some text.
</FIELD3>
<FIELD4>
Some text. Etc.
</FIELD4>
</DOC>
<DOC>
<FIELD1>
Some text.
</FIELD1>
<FIELD2>
Some text. Probably some more fields nested within this one.
</FIELD2>
<FIELD3>
Some text.
</FIELD3>
<FIELD4>
Some text. Etc.
</FIELD4>
</DOC>
</FILE>
My current approach (mimicking the code visible at http://effbot.org/zone/element-iterparse.htm#incremental-parsing ):
import xml.etree.ElementTree as ET
tree = ET.iterparse(xml_file)
tree = iter(tree)
event, root = tree.next()
for event, elem in tree:
if event == "end" and elem.tag == "DOC":
root.clear()
It explodes, although it uses all of my system memory (16 GB). At first I thought it was a position root.clear(), so I tried to move it after an if statement, but it showed no effect. Given this, I am quite sure how to proceed further than "get more memory."
EDIT
Removed previous edit because it was wrong.
source
share