There are several possible compression filters that you could use. Since HDF5 version 1.8.11 , you can easily register third-party compression filters.
Regarding performance:
It probably depends on your access pattern, because you probably want to determine the right sizes for your pieces so that it fits well with your access pattern, otherwise your performance will suffer a lot (for example, if you know that you usually refer to one column and in all rows you must define your piece shape accordingly (1,9000) ). See here , here and here for some details.
However, AFAIK pandas will usually load the entire HDF5 file into memory if you do not use read_table and iterator (see here ) or do partial IO (see here ) and thus, it doesn’t really bring much benefit in that it determines a good piece size.
However, you can still use compression because loading compressed data into memory and compressing it using processors is probably faster than loading uncompressed data.
Regarding your original question:
I would recommend a look at Blosc . This is a multi-threaded meta-compressor library that supports various compression filters:
- BloscLZ: The default FastLZ-based internal compressor.
- LZ4: compact, very popular and fast compressor.
- LZ4HC: An improved version of LZ4, provides better compression ratios due to speed.
- Snappy: A popular compressor used in many places.
- Zlib: classic slightly slower than the previous ones, but achieving higher compression ratios.
They have different advantages, and it is best to try and compare them with your data and see which ones work best.
source share