Secret while storing data frame containing HDF strings with pandas

Here's something weird with pandas and HDF for Halloween:

df = pandas.DataFrame([['a','b'] for i in range(1,1000)]) store = pandas.HDFStore('test.h5') store['x'] = df store.close() 

then

 ls -l test.h5 -rw-r--r-- 1 arthur arthur 1072080 Oct 26 10:50 test.h5 

1.1M? A bit cool, but why not. Here where things get really creepy

 store = pandas.HDFStore('test.h5') #open it again store['x'] = df #do the same thing as before! store.close() 

then

 ls -l test.h5 -rw-r--r-- 1 arthur arthur 2122768 Oct 26 10:52 test.h5 

You have now entered the Twilight zone. It goes without saying that after the operation, the operation is indistinguishable, but each iteration makes the file a bit more saturated.

This seems to only happen when strings are involved. Before I write a bug report, I would like to know if something is missing here ...

+7
source share
2 answers

This seems to be the reason: http://www.hdfgroup.org/hdf5-quest.html#del

This is one big version of HDF5, wtf.

+4
source

Yes: "HDF5 is not a database." People often use ptrepack (part of PyTables) to β€œrepackage” an HDF5 file without any dead bytes.

+4
source

All Articles