What is the recommended compression for HDF5 for fast read / write (in Python / pandas)?

Question

What is the recommended compression for HDF5 for fast read / write (in Python / pandas)?

I read several times that enabling compression in HDF5 can lead to better read / write performance.

I wonder what ideal settings might be to achieve good read / write performance at:

data_df.to_hdf(..., format='fixed', complib=..., complevel=..., chunksize=...)

I already use the fixed format (i.e. h5py ) faster than table . I have strong processors and not really bothered by disk space.

I often store a DataFrame from float64 and str types in files ok. 2500 rows x 9000 columns.

+6

pandas compression hpc hdf5 h5py

Mark horvath Jul 13 '15 at 12:13

source share

1 answer

Ümit · Accepted Answer · 2015-07-14T08:39:36+0000

There are several possible compression filters that you could use. Since HDF5 version 1.8.11 , you can easily register third-party compression filters.

Regarding performance:

It probably depends on your access pattern, because you probably want to determine the right sizes for your pieces so that it fits well with your access pattern, otherwise your performance will suffer a lot (for example, if you know that you usually refer to one column and in all rows you must define your piece shape accordingly (1,9000) ). See here , here and here for some details.

However, AFAIK pandas will usually load the entire HDF5 file into memory if you do not use read_table and iterator (see here ) or do partial IO (see here ) and thus, it doesn’t really bring much benefit in that it determines a good piece size.

However, you can still use compression because loading compressed data into memory and compressing it using processors is probably faster than loading uncompressed data.

Regarding your original question:

I would recommend a look at Blosc . This is a multi-threaded meta-compressor library that supports various compression filters:

BloscLZ: The default FastLZ-based internal compressor.
LZ4: compact, very popular and fast compressor.
LZ4HC: An improved version of LZ4, provides better compression ratios due to speed.
Snappy: A popular compressor used in many places.
Zlib: classic slightly slower than the previous ones, but achieving higher compression ratios.

They have different advantages, and it is best to try and compare them with your data and see which ones work best.

What is the recommended compression for HDF5 for fast read / write (in Python / pandas)?

Regarding performance:

Regarding your original question:

More articles: