Why are CSV files smaller than HDF5 files when recording with Pandas?

import numpy as np
import pandas as pd

df = pd.DataFrame(data=np.zeros((1000000,1)))
df.to_csv('test.csv')
df.to_hdf('test.h5', 'df')

ls -sh test*
11M test.csv  16M test.h5

If I use an even larger dataset, the effect is even greater. Using HDFStore, as shown below, does not change anything.

store = pd.HDFStore('test.h5', table=True)
store['df'] = np.zeros((1000000,1))
store.close()

Edit: It doesn't matter. An example is bad! Using some non-trivial numbers instead of zeros changes the story.

from numpy.random import rand
import pandas as pd

df = pd.DataFrame(data=rand(10000000,1))
df.to_csv('test.csv')
df.to_hdf('test.h5', 'df')

ls -sh test*
260M test.csv  153M test.h5

Expressing numbers as a float should take less bytes than expressing them as strings of characters with one character per digit. This is usually true, with the exception of my first example, in which all the numbers were "0.0". Thus, not many characters were required to represent the number, so the string representation was smaller than the floating point representation.

+4
2

.csv :

999999,0.0<CR>

11 . 1 11 .

HD5, , 16 , , . , 16 * 1 000 000, 16 .

0.0, , .csv 25 , HDF5 . csv , HDF5 .

+2

:

  • csv "": , , (, ) float 1.0 , , csv , .csv.gz.

  • hdf5 - -, No Free Lunch : - . hdf5 .

: csv - . , hdf5 () , . .

+5

All Articles