Why are CSV files smaller than HDF5 files when recording with Pandas?

Question

Why are CSV files smaller than HDF5 files when recording with Pandas?

import numpy as np
import pandas as pd

df = pd.DataFrame(data=np.zeros((1000000,1)))
df.to_csv('test.csv')
df.to_hdf('test.h5', 'df')

ls -sh test*
11M test.csv  16M test.h5

If I use an even larger dataset, the effect is even greater. Using HDFStore, as shown below, does not change anything.

store = pd.HDFStore('test.h5', table=True)
store['df'] = np.zeros((1000000,1))
store.close()

Edit: It doesn't matter. An example is bad! Using some non-trivial numbers instead of zeros changes the story.

from numpy.random import rand
import pandas as pd

df = pd.DataFrame(data=rand(10000000,1))
df.to_csv('test.csv')
df.to_hdf('test.h5', 'df')

ls -sh test*
260M test.csv  153M test.h5

Expressing numbers as a float should take less bytes than expressing them as strings of characters with one character per digit. This is usually true, with the exception of my first example, in which all the numbers were "0.0". Thus, not many characters were required to represent the number, so the string representation was smaller than the floating point representation.

+4

python pandas csv hdf5 hdf

jeffalstott Mar 09 '15 at 4:11

2

:

csv "": , , (, ) float 1.0 , , csv , .csv.gz.
hdf5 - -, No Free Lunch : - . hdf5 .

: csv - . , hdf5 () , . .

+5

Dirk Eddelbuettel 09 . '15 4:17

chw21 · Accepted Answer · 2015-03-09T04:34:22+0000

.csv :

999999,0.0<CR>

11 . 1 11 .

HD5, , 16 , , . , 16 * 1 000 000, 16 .

0.0, , .csv 25 , HDF5 . csv , HDF5 .

Why are CSV files smaller than HDF5 files when recording with Pandas?

More articles: