HDF5 Top Level Storage

Question

HDF5 Top Level Storage

I am writing a large number of small data sets to an HDF5 file, and the resulting file size is about 10x, which I expect from a naive tabulation of the data that I insert. My data is organized hierarchically as follows:

group 0 -> subgroup 0 -> dataset (dimensions: 100 x 4, datatype: float) -> dataset (dimensions: 100, datatype: float) -> subgroup 1 -> dataset (dimensions: 100 x 4, datatype: float) -> dataset (dimensions: 100, datatype: float) ... group 1 ...

Each subgroup should occupy 500 * 4 bytes = 2000 bytes, ignoring overhead. I do not store any attributes next to the data. However, when testing, I found that each subgroup takes about 4 kB, or about twice as much as I would expect. I understand that there is some overhead, but where does it come from, and how can I reduce it? Is this a representation of the group structure?

Additional information: If I increase the size of two data sets in each subgroup to 1000 x 4 and 1000, then each subgroup occupies about 22,250 bytes, rather than the flat 20,000 bytes that I expect. This implies an overhead of 2.2 kB per subgroup and is consistent with the results that I got with smaller data sizes. Is there a way to reduce this overhead?

+6

hdf5 scientific-computing

Thucydides411 Jan 15 '13 at 6:24

source share

1 answer

Thucydides411 · Accepted Answer · 2013-03-08T03:05:28+0000

I will answer my question. The overhead associated only with the representation of the group structure is sufficient so that it makes no sense to store small arrays or have many groups, each of which contains only a small amount of data. There seems to be no way to reduce the overhead for each group, which I measured at about 2.2 kB.

I solved this problem by combining the two datasets in each subgroup into a dataset (100 x 5). Then I deleted the subgroups and combined all the data sets in each group into a 3D data set. Thus, if I used to have N subgroups, I now have one data set in each group with a form (N x 100 x 5). Thus, I save the overhead N * 2.2 kB that was previously present. Moreover, since the built-in compression of HDF5 is more efficient with large arrays, now I get the ratio of the total packing of more than 1: 1, whereas before the overhead took up half the file space, and the compression was completely ineffective.

The lesson is to avoid complex group structures in HDF5 files and try to combine as much data as possible into each data set.

HDF5 Top Level Storage

More articles: