Compression performance related to block size in hdf5 files

Question

Compression performance related to block size in hdf5 files

I would like to ask a question about compression performance which is related to hdf5 file size.

I have two hdf5 files that have the following properties. They both contain only one data set, called "data."

File A "data":

Type: Scalar HDF5 Dataset
Not. Sizes: 2
Size Size: 5094125 x 6
Max. size size: Unlimited x Unlimited
Data Type: 64-bit Floating Point
Chunking: 10000 x 6
Compression: gzip level = 7

File B "data":

Type: Scalar HDF5 Dataset
Not. Sizes: 2
Size Size: 6720 x 1000
Max. size size: Unlimited x Unlimited
Data Type: 64-bit Floating Point
Chunking: 6000 x 1
Compression: gzip level = 7

File Size: HDF5 ---- 19 MB CSV ----- 165 MB

File size B: HDF5 ---- 60 MB CSV ----- 165 MB

Both of them show excellent compression of data stored when comparing with CSV files. However, the compression ratio of file A is about 10% of the original csv, while file B is only about 30% of the original csv.

I tried different block sizes to make file B as small as possible, but it seems that 30% is the optimal compression ratio. I would like to ask why file A can achieve more compression, but file B cannot.

If file B can also achieve, what should be the block size?

Is this any rule for determining the optimal HDF5 block size for compression?

Thanks!

+7

compression hdf5 chunking

CT May 28 '13 at 7:40

source share

1 answer

Yossarian · Accepted Answer · 2013-05-31T13:08:56+0000

Chunking really doesn’t affect compression ratio per se, except as @ Ümit describes. What chunking does affects I / O performance. When compression is applied to an HDF5 dataset, it is applied to whole pieces individually. This means that when reading data from one fragment in a data set, the entire fragment must be unpacked - possibly using a much larger number of I / O operations, depending on the cache size, fragment shape, etc.

What you need to do is make sure that the shape of the fragment matches the way you read / write your data. If you usually read a column at a time, make your own fragment columns, for example. This is a good chunking tutorial.

Compression performance related to block size in hdf5 files

More articles: