If you have large xml that cannot fit in memory, you can try to serialize it one element at a time. For example, assuming the structure of the document <root><page/><page/><page/>...</root> and ignoring possible problems with the namespace:
import xml.etree.cElementTree as etree def getelements(filename_or_file, tag): context = iter(etree.iterparse(filename_or_file, events=('start', 'end'))) _, root = next(context) # get root element for event, elem in context: if event == 'end' and elem.tag == tag: yield elem root.clear() # free memory with open('output.xml', 'wb') as file: # start root file.write(b'<root>') for page in getelements('sample.xml', 'page'): if keep(page): file.write(etree.tostring(page, encoding='utf-8')) # close root file.write(b'</root>')
where keep(page) returns True if page should be saved, for example:
import re def keep(page): # all <revision> elements must have 20xx in them return all(re.search(r'20\d\d', rev.text) for rev in page.iterfind('revision'))
For comparison, to change a small xml file, you can:
# parse small xml tree = etree.parse('sample.xml') # remove some root/page elements from xml root = tree.getroot() for page in root.findall('page'): if not keep(page): root.remove(page) # modify inplace # write to a file modified xml tree tree.write('output.xml', encoding='utf-8')
jfs
source share