How to save pandas data file in gzipped format directly?

I have a pandas data frame called df .

I want to save this in gzipped format. One way to do this:

 import gzip import pandas df.save('filename.pickle') f_in = open('filename.pickle', 'rb') f_out = gzip.open('filename.pickle.gz', 'wb') f_out.writelines(f_in) f_in.close() f_out.close() 

However, to do this, you must first create a file called filename.pickle . Is there a way to do this more directly, i.e. Without creating filename.pickle ?

When I want to load a data frame that was gzipped, I need to go through the same step of creating filename.pickle. For example, to read the file filename2.pickle.gzip , which is gzipped pandas, I know about the following method:

 f_in = gzip.open('filename2.pickle.gz', 'rb') f_out = gzip.open('filename2.pickle', 'wb') f_out.writelines(f_in) f_in.close() f_out.close() df2 = pandas.load('filename2.pickle') 

Can this be done without first creating filename2.pickle ?

+7
source share
4 answers

In the future, we plan to add better serialization with compression. Stay tuned for pandas

+8
source

Improved serialization with compression has been added to the Pandas series. (Starting with pandas 0.20.0.) Here is an example of how it can be used:

 df.to_csv("my_file.gz", compression="gzip") 

For more information, for example, about the available compression forms, check the documents .

+10
source

For some reason, the Python zlib module has the ability to decompress gzip data, but it does not have the ability to directly compress the format. At least as documented. This is despite the remarkably misleading title of the gzip compatible compression documentation page.

You can compress zlib format instead of zlib.compress or zlib.compressobj , and then remove the zlib header and trailer and add gzip and trailer, since the zlib and gzip formats use the same compressed data format . This will give you gzip data. The zlib header is fixed in two bytes, and the trailer in four bytes, so they are easy to split. Then you can add a basic gzip header of ten bytes: "\x1f\x8b\x08\0\0\0\0\0\0\xff" (C string format) and add a four-byte CRC in little-endian order. CRC can be calculated using zlib.crc32 .

+2
source

You can dump data to a string using pickle.dumps and then write it to disk using import gzip

 file = gzip.GzipFile('filename.pickle.gz', 'wb', 3) file.write(pickle.dumps(df)) file.close() 
+1
source

All Articles