How to save pandas data file in gzipped format directly?

Question

How to save pandas data file in gzipped format directly?

I have a pandas data frame called df .

I want to save this in gzipped format. One way to do this:

 import gzip import pandas df.save('filename.pickle') f_in = open('filename.pickle', 'rb') f_out = gzip.open('filename.pickle.gz', 'wb') f_out.writelines(f_in) f_in.close() f_out.close()

However, to do this, you must first create a file called filename.pickle . Is there a way to do this more directly, i.e. Without creating filename.pickle ?

When I want to load a data frame that was gzipped, I need to go through the same step of creating filename.pickle. For example, to read the file filename2.pickle.gzip , which is gzipped pandas, I know about the following method:

 f_in = gzip.open('filename2.pickle.gz', 'rb') f_out = gzip.open('filename2.pickle', 'wb') f_out.writelines(f_in) f_in.close() f_out.close() df2 = pandas.load('filename2.pickle')

Can this be done without first creating filename2.pickle ?

+7

python pandas gzip

Curious2learn Oct 23 '12 at 14:54

source share

4 answers

Improved serialization with compression has been added to the Pandas series. (Starting with pandas 0.20.0.) Here is an example of how it can be used:

 df.to_csv("my_file.gz", compression="gzip")

For more information, for example, about the available compression forms, check the documents .

+10

Seanny123 May 19 '16 at 15:33

source share

For some reason, the Python zlib module has the ability to decompress gzip data, but it does not have the ability to directly compress the format. At least as documented. This is despite the remarkably misleading title of the gzip compatible compression documentation page.

You can compress zlib format instead of zlib.compress or zlib.compressobj , and then remove the zlib header and trailer and add gzip and trailer, since the zlib and gzip formats use the same compressed data format . This will give you gzip data. The zlib header is fixed in two bytes, and the trailer in four bytes, so they are easy to split. Then you can add a basic gzip header of ten bytes: "\x1f\x8b\x08\0\0\0\0\0\0\xff" (C string format) and add a four-byte CRC in little-endian order. CRC can be calculated using zlib.crc32 .

+2

Mark adler Oct 23 '12 at 15:18

source share

You can dump data to a string using pickle.dumps and then write it to disk using import gzip

 file = gzip.GzipFile('filename.pickle.gz', 'wb', 3) file.write(pickle.dumps(df)) file.close()

+1

Viacheslav Nefedov Jun 22 '13 at 10:03

source share

Wes mckinney · Accepted Answer · 2012-10-27T18:19:16+0000

In the future, we plan to add better serialization with compression. Stay tuned for pandas

How to save pandas data file in gzipped format directly?

More articles: