Apply gzip compression to csv in python pandas

I am trying to write a dataframe in gsipped csv in python pandas using the following:

import pandas as pd import datetime import csv import gzip # Get data (with previous connection and script variables) df = pd.read_sql_query(script, conn) # Create today date, to append to file todaysdatestring = str(datetime.datetime.today().strftime('%Y%m%d')) print todaysdatestring # Create csv with gzip compression df.to_csv('foo-%s.csv.gz' % todaysdatestring, sep='|', header=True, index=False, quoting=csv.QUOTE_ALL, compression='gzip', quotechar='"', doublequote=True, line_terminator='\n') 

It just creates a csv called "foo-YYYYMMDD.csv.gz" and not the actual gzip archive.

I also tried adding this:

 #Turn to_csv statement into a variable d = df.to_csv('foo-%s.csv.gz' % todaysdatestring, sep='|', header=True, index=False, quoting=csv.QUOTE_ALL, compression='gzip', quotechar='"', doublequote=True, line_terminator='\n') # Write above variable to gzip with gzip.open('foo-%s.csv.gz' % todaysdatestring, 'wb') as output: output.write(d) 

What fails. Any ideas?

+6
source share
4 answers

Using df.to_csv() with the keyword argument compression='gzip' should lead to the creation of the gzip archive. I tested it using the same keywords as you, and it worked.

You may need to update pandas since gzip was not implemented before version 0.17.1, but trying to use it in previous versions will not result in an error and just call regular csv. You can determine the current version of pandas by looking at the output of pd.__version__ .

+10
source

From the documentation

 import gzip content = "Lots of content here" with gzip.open('file.txt.gz', 'wb') as f: f.write(content) 

with pandas

 import gzip content = df.to_csv( sep='|', header=True, index=False, quoting=csv.QUOTE_ALL, quotechar='"', doublequote=True, line_terminator='\n') with gzip.open('foo-%s.csv.gz' % todaysdatestring, 'wb') as f: f.write(content) 

The trick is that to_csv prints text if you don't pass the file name to it. Then you simply redirect this text to the gzip write method.

+5
source

This is done very easily with pandas.

 import pandas as pd 

Write a pandas data framework to disk as compressed compressed compressed csv

 df.to_csv('dfsavename.csv.gz', compression='gzip') 

Read from disk

 df = pd.read_csv('dfsavename.csv.gz', compression='gzip') 
+3
source
 with gzip.open('foo-%s.csv.gz' % todaysdatestring, 'wb') as f: f.write(df.to_csv(sep='|', index=False, quoting=csv.QUOTE_ALL)) 
0
source

All Articles