Using the python function ElementTree itertree and writing the modified tree to the output file

Question

Using the python function ElementTree itertree and writing the modified tree to the output file

I need to parse a very large (~ 40 GB) XML file, remove certain elements from it, and write the result to a new XML file. I am trying to use iterparse from python ElementTree, but I am confused about how to modify the tree and then write the resulting tree to a new XML file. I read the documentation on itertree, but he did not clarify the situation. Are there any simple ways to do this?

Thanks!

EDIT: That's what I still have.

import xml.etree.ElementTree as ET import re date_pages = [] f=open('dates_texts.xml', 'w+') tree = ET.iterparse("sample.xml") for i, element in tree: if element.tag == 'page': for page_element in element: if page_element.tag == 'revision': for revision_element in page_element: if revision_element.tag == '{text': if len(re.findall('20\d\d', revision_element.text.encode('utf8'))) == 0: element.clear()

+8

python xml elementtree

Latatecoder Mar 14 '13 at 2:04

source share

2 answers

jfs · Answer 1 · 2013-03-17T03:59:51+0000

If you have large xml that cannot fit in memory, you can try to serialize it one element at a time. For example, assuming the structure of the document <root><page/><page/><page/>...</root> and ignoring possible problems with the namespace:

 import xml.etree.cElementTree as etree def getelements(filename_or_file, tag): context = iter(etree.iterparse(filename_or_file, events=('start', 'end'))) _, root = next(context) # get root element for event, elem in context: if event == 'end' and elem.tag == tag: yield elem root.clear() # free memory with open('output.xml', 'wb') as file: # start root file.write(b'<root>') for page in getelements('sample.xml', 'page'): if keep(page): file.write(etree.tostring(page, encoding='utf-8')) # close root file.write(b'</root>')

where keep(page) returns True if page should be saved, for example:

 import re def keep(page): # all <revision> elements must have 20xx in them return all(re.search(r'20\d\d', rev.text) for rev in page.iterfind('revision'))

For comparison, to change a small xml file, you can:

 # parse small xml tree = etree.parse('sample.xml') # remove some root/page elements from xml root = tree.getroot() for page in root.findall('page'): if not keep(page): root.remove(page) # modify inplace # write to a file modified xml tree tree.write('output.xml', encoding='utf-8')

Qanthelas · Answer 2 · 2013-03-17T02:08:38+0000

Perhaps the answer to my similar question may help you.

Regarding how to write this back to the .xml file, I ended this at the bottom of my script:

 with open('File.xml', 'w') as t: # I'd suggest using a different file name here than your original for line in ET.tostring(doc): t.write(line) t.close print('File.xml Complete') # Console message that file wrote successfully, can be omitted

The doc variable from earlier in my script is comparable to where you have tree = ET.iterparse("sample.xml") I have this:

 doc = ET.parse(filename)

I use lxml instead of ElementTree, but I think that part of the record will work anyway (I think that basically it is just xpath material that ElementTree cannot handle.) I use lxml imported with this line:

 from lxml import etree as ET

Hope this (along with my related question for some additional code context if you need it) can help you!

Using the python function ElementTree itertree and writing the modified tree to the output file

More articles: