I would like to ask a question about compression performance which is related to hdf5 file size.
I have two hdf5 files that have the following properties. They both contain only one data set, called "data."
File A "data":
- Type: Scalar HDF5 Dataset
- Not. Sizes: 2
- Size Size: 5094125 x 6
- Max. size size: Unlimited x Unlimited
- Data Type: 64-bit Floating Point
- Chunking: 10000 x 6
- Compression: gzip level = 7
File B "data":
- Type: Scalar HDF5 Dataset
- Not. Sizes: 2
- Size Size: 6720 x 1000
- Max. size size: Unlimited x Unlimited
- Data Type: 64-bit Floating Point
- Chunking: 6000 x 1
- Compression: gzip level = 7
File Size: HDF5 ---- 19 MB CSV ----- 165 MB
File size B: HDF5 ---- 60 MB CSV ----- 165 MB
Both of them show excellent compression of data stored when comparing with CSV files. However, the compression ratio of file A is about 10% of the original csv, while file B is only about 30% of the original csv.
I tried different block sizes to make file B as small as possible, but it seems that 30% is the optimal compression ratio. I would like to ask why file A can achieve more compression, but file B cannot.
If file B can also achieve, what should be the block size?
Is this any rule for determining the optimal HDF5 block size for compression?
Thanks!
CT
source share