I am writing a large number of small data sets to an HDF5 file, and the resulting file size is about 10x, which I expect from a naive tabulation of the data that I insert. My data is organized hierarchically as follows:
group 0 -> subgroup 0 -> dataset (dimensions: 100 x 4, datatype: float) -> dataset (dimensions: 100, datatype: float) -> subgroup 1 -> dataset (dimensions: 100 x 4, datatype: float) -> dataset (dimensions: 100, datatype: float) ... group 1 ...
Each subgroup should occupy 500 * 4 bytes = 2000 bytes, ignoring overhead. I do not store any attributes next to the data. However, when testing, I found that each subgroup takes about 4 kB, or about twice as much as I would expect. I understand that there is some overhead, but where does it come from, and how can I reduce it? Is this a representation of the group structure?
Additional information: If I increase the size of two data sets in each subgroup to 1000 x 4 and 1000, then each subgroup occupies about 22,250 bytes, rather than the flat 20,000 bytes that I expect. This implies an overhead of 2.2 kB per subgroup and is consistent with the results that I got with smaller data sizes. Is there a way to reduce this overhead?
source share