import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.zeros((1000000,1)))
df.to_csv('test.csv')
df.to_hdf('test.h5', 'df')
ls -sh test*
11M test.csv 16M test.h5
If I use an even larger dataset, the effect is even greater. Using HDFStore, as shown below, does not change anything.
store = pd.HDFStore('test.h5', table=True)
store['df'] = np.zeros((1000000,1))
store.close()
Edit: It doesn't matter. An example is bad! Using some non-trivial numbers instead of zeros changes the story.
from numpy.random import rand
import pandas as pd
df = pd.DataFrame(data=rand(10000000,1))
df.to_csv('test.csv')
df.to_hdf('test.h5', 'df')
ls -sh test*
260M test.csv 153M test.h5
Expressing numbers as a float should take less bytes than expressing them as strings of characters with one character per digit. This is usually true, with the exception of my first example, in which all the numbers were "0.0". Thus, not many characters were required to represent the number, so the string representation was smaller than the floating point representation.