Efficient array compression

I tried using various data compression methods while saving numpy arrays to disk.

These 1D arrays contain sample data with a specific sampling rate (can be recorded with a microphone or any other measurement with any sensor): the data is essentially continuous (in the mathematical sense, of course, now there is discrete data after the sample).

I tried with HDF5 (h5py):

 f.create_dataset("myarray1", myarray, compression="gzip", compression_opts=9) 

but it’s rather slow, and the compression ratio is not the best we can expect.

I also tried using

 numpy.savez_compressed() 

but again, it may not be the best compression algorithm for such data (described earlier).

What would you choose for the best compression ratio on a numpy array , with such data?

(I was thinking about things like lossless FLAC (originally designed for audio), but is there an easy way to apply such an algorithm to numpy data?)

+7
python arrays numpy compression lossless-compression
source share
6 answers
  • The noise is incompressible. Thus, any part of the data that you have is 1: 1 compressed data regardless of the compression algorithm, unless you discard it in any way (lossy compression). If you have 24 bits per sample with an effective bit number (ENOB) of 16 bits, the remaining 24-16 = 8 bits of noise will limit the maximum lossless compression ratio to 3: 1, even if your (silent) data is perfectly compressed. Uneven noise is compressed to the extent that it is heterogeneous; you probably want to take a look at the effective entropy of noise to determine how compressible it is.

  • Data compression is based on its modeling (partly to eliminate redundancy, but also partly so that you can separate from noise and throw noise). For example, if you know that your data is limited to a bandwidth of up to 10 MHz, and you sample at a frequency of 200 MHz, you can do FFT, zero high frequencies and save coefficients only for low frequencies (in this example: 10: 1 compression). There is a whole field called “squeezing perception” that is associated with this.

  • A practical suggestion suitable for many types of reasonably continuous data: denoise → bandwidth limit → delta compress → gzip (or xz, etc.). Denoise can be the same as bandwidth limitation, or a non-linear filter like the current median. Bandwidth limit can be implemented using FIR / IIR. A delta compress is just y [n] = x [n] - x [n-1].

EDIT Illustration:

 from pylab import * import numpy import numpy.random import os.path import subprocess # create 1M data points of a 24-bit sine wave with 8 bits of gaussian noise (ENOB=16) N = 1000000 data = (sin( 2 * pi * linspace(0,N,N) / 100 ) * (1<<23) + \ numpy.random.randn(N) * (1<<7)).astype(int32) numpy.save('data.npy', data) print os.path.getsize('data.npy') # 4000080 uncompressed size subprocess.call('xz -9 data.npy', shell=True) print os.path.getsize('data.npy.xz') # 1484192 compressed size # 11.87 bits per sample, ~8 bits of that is noise data_quantized = data / (1<<8) numpy.save('data_quantized.npy', data_quantized) subprocess.call('xz -9 data_quantized.npy', shell=True) print os.path.getsize('data_quantized.npy.xz') # 318380 # still have 16 bits of signal, but only takes 2.55 bits per sample to store it 
+8
source share

What am I doing now:

 import gzip import numpy f = gzip.GzipFile("my_array.npy.gz", "w") numpy.save(file=f, arr=my_array) f.close() 
+5
source share

What constitutes the best compression (if any) is highly dependent on the nature of the data. Many types of measurement data are almost completely incompressible if lossless compression is really required.

The pytables docs contain a lot of useful data compression guidelines. It also details the compromise of speed and so on; higher compression levels are usually a waste of time, as it turns out.

http://pytables.imtqy.com/usersguide/optimization.html

Please note that this is probably as good as it gets. For integer measurements, a combination of a shuffle filter with simple zip compression usually works quite well. This filter makes very effective use of the general situation when the high byte is usually 0 and is included only for protection against overflow.

+1
source share

First, for shared datasets, the shuffle=True argument to create_dataset greatly improves compression using roughly continuous datasets. It very cleverly swaps bits for compression, so (for continuous data) bits change slowly, which means they can be compressed better. This slows down compression very little in my experience, but can significantly improve the compression ratio in my experience. This is not a loss, so you really get the same data as you.

If you don't care about accuracy like that, you can also use the scaleoffset argument to limit the number of bits stored. Be careful because this is not what it might seem. In particular, this is absolute accuracy, not relative accuracy. For example, if you pass scaleoffset=8 , but your data is less than 1e-8 , you will only get zeros. Of course, if you scaled the data to a maximum of about 1 and don’t think you can hear differences smaller than the part in a million, you can pass scaleoffset=6 and get great compression without much work.

But especially for audio, I expect you to be right in using FLAC because its developers put in huge thoughts, balancing compression while preserving distinguishable details. You can convert to WAV with scipy and from there to FLAC .

+1
source share

You might want to try blz . It can efficiently compress binary data.

 import blz # this stores the array in memory blz.barray(myarray) # this stores the array on disk blz.barray(myarray, rootdir='arrays') 

stores arrays either in a file or compressed in memory. Compression is based on blosc . See scipy video for a little context.

0
source share

Saving an HDF5 file with compression can be very fast and efficient: it all depends on the compression algorithm, and whether you want it to be fast when saving or reading it or both. And, of course, on the data itself, as explained above. GZIP tends to be somewhere in the middle, but with a low compression ratio. BZIP2 is slow on both sides, although with a better ratio. BLOSC is one of the algorithms that I found to get pretty compression, and fast at both ends. The disadvantage of BLOSC is that it is not implemented in all HDF5 implementations. Therefore, your program may not be portable. You always need to do at least some tests to choose the best configuration for your needs.

0
source share

All Articles