How to gzip a large text file without a MemoryError?

I use the following simple Python script to compress a large text file (e.g. 10 GB ) in an EC2 m3.large instance. However, I always got a MemoryError :

 import gzip with open('test_large.csv', 'rb') as f_in: with gzip.open('test_out.csv.gz', 'wb') as f_out: f_out.writelines(f_in) # or the following: # for line in f_in: # f_out.write(line) 

I got a trace:

 Traceback (most recent call last): File "test.py", line 8, in <module> f_out.writelines(f_in) MemoryError 

I read some discussion of this problem, but it is still not entirely clear how to handle this. Can someone give me a more clear answer on how to deal with this problem?

+5
source share
3 answers

This is strange. I would expect this error if you tried to compress a large binary file that did not contain many new lines, since such a file might contain a "line" that was too large for your RAM, but this should not happen on a line structured by CSV file.

But in any case, it is not very efficient to compress files line by line. Despite the fact that the OS buffers the input / output of data on the disk, as a rule, it is much faster to read and write larger blocks of data, for example, 64 kB.

I have 2 GB of RAM on this computer and I just successfully used the program below to compress the tar tar 2.8GB archive.

 #! /usr/bin/env python import gzip import sys blocksize = 1 << 16 #64kB def gzipfile(iname, oname, level): with open(iname, 'rb') as f_in: f_out = gzip.open(oname, 'wb', level) while True: block = f_in.read(blocksize) if block == '': break f_out.write(block) f_out.close() return def main(): if len(sys.argv) < 3: print "gzip compress in_file to out_file" print "Usage:\n%s in_file out_file [compression_level]" % sys.argv[0] exit(1) iname = sys.argv[1] oname = sys.argv[2] level = int(sys.argv[3]) if len(sys.argv) > 3 else 6 gzipfile(iname, oname, level) if __name__ == '__main__': main() 

I am running Python 2.6.6 and gzip.open() does not support with .


As Andrew Bay notes in the comments, if block == '': will not work correctly in Python 3, since block contains bytes, not a string, and an empty byte object is not compared as equal to an empty text string, We can check the length of the block or compare with b'' (which will also work in Python 2.6+), but the simple way is if not block:

+5
source

The problem here has nothing to do with gzip, and all you need to do to read line by line from a 10 GB file without new lines in it:

As a side note, the file I used to test the Python gzip function is generated by fallocate -l 10G bigfile_file.

This gives you a 10x sparse file made entirely of 0 bytes. This means that a newline byte does not exist. The first row value is 10 GB. This means that the first row will require 10 GB. (Or maybe even 20 or 40 GB if you are using pre-3.3 Python and trying to read it as Unicode.)

If you want to copy binary data, do not copy line by line. Whether it's a regular file, a GzipFile that decompresses for you on the fly, socket.makefile() or something else, you will have the same problem.

The solution is to copy a piece with a piece. Or just use copyfileobj , which does this automatically for you.

 import gzip import shutil with open('test_large.csv', 'rb') as f_in: with gzip.open('test_out.csv.gz', 'wb') as f_out: shutil.copyfileobj(f_in, f_out) 

By default, copyfileobj uses a block size optimized to be very good and never very bad. In this case, you may need a smaller size or a larger one; it is difficult to predict that a priori. * So check it out using timeit with various bufsize arguments (say degrees 4 from 1 KB to 8 MB) to copyfileobj . But by default, 16KB is likely to be good enough unless you do a lot of this.

* If the buffer size is too large, you may end up alternating with long chunks of I / O and long chunks of processing. If it is too small, you may need several readings to fill one gzip block.

+6
source

It is strange to receive a memory error even when reading a file line by line. I suppose this is because you have very little available memory and very large lines. Then you should use binary readings:

 import gzip #adapt size value : small values will take more time, high value could cause memory errors size = 8096 with open('test_large.csv', 'rb') as f_in: with gzip.open('test_out.csv.gz', 'wb') as f_out: while True: data = f_in.read(size) if data == '' : break f_out.write(data) 
+3
source

All Articles