Using the python function ElementTree itertree and writing the modified tree to the output file

I need to parse a very large (~ 40 GB) XML file, remove certain elements from it, and write the result to a new XML file. I am trying to use iterparse from python ElementTree, but I am confused about how to modify the tree and then write the resulting tree to a new XML file. I read the documentation on itertree, but he did not clarify the situation. Are there any simple ways to do this?

Thanks!

EDIT: That's what I still have.

import xml.etree.ElementTree as ET import re date_pages = [] f=open('dates_texts.xml', 'w+') tree = ET.iterparse("sample.xml") for i, element in tree: if element.tag == 'page': for page_element in element: if page_element.tag == 'revision': for revision_element in page_element: if revision_element.tag == '{text': if len(re.findall('20\d\d', revision_element.text.encode('utf8'))) == 0: element.clear() 
+8
python xml elementtree
source share
2 answers

If you have large xml that cannot fit in memory, you can try to serialize it one element at a time. For example, assuming the structure of the document <root><page/><page/><page/>...</root> and ignoring possible problems with the namespace:

 import xml.etree.cElementTree as etree def getelements(filename_or_file, tag): context = iter(etree.iterparse(filename_or_file, events=('start', 'end'))) _, root = next(context) # get root element for event, elem in context: if event == 'end' and elem.tag == tag: yield elem root.clear() # free memory with open('output.xml', 'wb') as file: # start root file.write(b'<root>') for page in getelements('sample.xml', 'page'): if keep(page): file.write(etree.tostring(page, encoding='utf-8')) # close root file.write(b'</root>') 

where keep(page) returns True if page should be saved, for example:

 import re def keep(page): # all <revision> elements must have 20xx in them return all(re.search(r'20\d\d', rev.text) for rev in page.iterfind('revision')) 

For comparison, to change a small xml file, you can:

 # parse small xml tree = etree.parse('sample.xml') # remove some root/page elements from xml root = tree.getroot() for page in root.findall('page'): if not keep(page): root.remove(page) # modify inplace # write to a file modified xml tree tree.write('output.xml', encoding='utf-8') 
+6
source share

Perhaps the answer to my similar question may help you.

Regarding how to write this back to the .xml file, I ended this at the bottom of my script:

 with open('File.xml', 'w') as t: # I'd suggest using a different file name here than your original for line in ET.tostring(doc): t.write(line) t.close print('File.xml Complete') # Console message that file wrote successfully, can be omitted 

The doc variable from earlier in my script is comparable to where you have tree = ET.iterparse("sample.xml") I have this:

 doc = ET.parse(filename) 

I use lxml instead of ElementTree, but I think that part of the record will work anyway (I think that basically it is just xpath material that ElementTree cannot handle.) I use lxml imported with this line:

 from lxml import etree as ET 

Hope this (along with my related question for some additional code context if you need it) can help you!

+1
source share

All Articles