Python / numpy data compression

I am looking to use the Amazon cloud for all my simulation needs. The resulting sim files are quite large, and I would like to move them to my local drive for ease of analysis, etc. You have to pay for the data that you are moving, so I want to compress all my sim games as little as possible. These are just numpy arrays saved as .mat files using:

import scipy.io as sio sio.savemat(filepath, do_compression = True) 

So my question is: what is the best way to compress numpy arrays (they are currently stored in .mat files, but I can store them using any python method) using python compression compression, linux compression, or both?

I am in a linux environment and I am open to any type of file compression.

+7
source share
2 answers

Unless you know something special about arrays (like rarity or some kind of pattern), you won’t do much better than default compression, and maybe gzip on top of that. In fact, you may not even need gzip files if you use HTTP to download and your server is configured for compression. Good lossless compression algorithms rarely change by more than 10%.

If savemat works as advertised, you can get gzip compression in python with:

 import scipy.io as sio import gzip f_out = gzip.open(filepath_dot_gz, 'wb') sio.savemat(f_out, do_compression = True) 
+7
source

In addition, LZMA (AKA xz ) provides very good compression on fairly sparse numpy arrays, although it is quite slow during compression (and may require more memory).

On Ubuntu, it is installed with sudo apt-get install python-lzma

It is used like any other shell of a file object, something like this (for loading pickled data):

 from lzma import LZMAFile import cPickle as pickle if fileName.endswith('.xz'): dataFile = LZMAFile(fileName,'r') else: dataFile = file(fileName, 'ro') data = pickle.load(dataFile) 
+1
source

All Articles