Downloading huge XML files and accessing a MemoryError

Question

Downloading huge XML files and accessing a MemoryError

I have a very large XML file (more precisely, 20GB, and yes, I need all this). When I try to upload a file, I get this error:

Python(23358) malloc: *** mmap(size=140736680968192) failed (error code=12) *** error: can't allocate region *** set a breakpoint in malloc_error_break to debug Traceback (most recent call last): File "file.py", line 5, in <module> code = xml.read() MemoryError

This is the current code I have for reading an XML file:

 from bs4 import BeautifulSoup xml = open('pages_full.xml', 'r') code = xml.read() xml.close() soup = BeautifulSoup(code)

Now, how am I going to fix this error and continue working on the script. I would try to split the file into separate files, but since I do not know how this will affect BeautifulSoup, as well as the XML data, I would prefer not to.

(XML data is a database dump from a wiki that I drive to, using it to import data from different time periods, using direct information from many pages)

+8

python xml mediawiki beautifulsoup

Hairr Feb 17 '13 at 17:58

source share

1 answer

Martijn pieters · Accepted Answer · 2013-02-17T18:18:13+0000

Do not use BeautifulSoup to try such a large XML parsing file. Use the ElementTree API instead. In particular, use the iterparse() function to iterparse() your file as a stream, process the information as elements are notified, and then delete the elements again:

 from xml.etree import ElementTree as ET parser = ET.iterparse(filename) for event, element in parser: # element is a whole element if element.tag == 'yourelement' # do something with this element # then clean up element.clear()

Using an event-based approach, you do not need to store the entire XML document in memory, you only extract what you need and discard the rest.

See the iterparse() tutorial and documentation .

Alternatively, you can also use the lxml library ; it offers the same API in a faster and more functional package.

Downloading huge XML files and accessing a MemoryError

More articles: