Read the HDF5 file in pandas DataFrame with conditions

I have a huge HDF5 file, I want to load part of it into a pandas DataFrame to perform some operations, but I'm interested in filtering some rows.

I can better explain with an example:

The original HDF5 file will look something like this:

ABCD 1 0 34 11 2 0 32 15 3 1 35 22 4 1 34 15 5 1 31 9 1 0 34 15 2 1 29 11 3 0 34 15 4 1 12 14 5 0 34 15 1 0 32 13 2 1 34 15 etc etc etc etc 

What I'm trying to do is load this, exactly the same as in the pandas Dataframe, but only where A==1 or 3 or 4

So far, I can simply download all of HDF5 using:

 store = pd.HDFStore('Resutls2015_10_21.h5') df = pd.DataFrame(store['results_table']) 

I do not see how to include the where clause here.

+6
source share
2 answers

The hdf5 file must be written in the table format (as opposed to the fixed format) in order to be requested using the pd.read_hdf where argument.

In addition, A must be declared as data_column :

 df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'], format='table') 

or, to indicate all columns as (requested) data columns:

 df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=True, format='table') 

Then you can use

 pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]') 

to select rows where the column of values โ€‹โ€‹of A is 1, 3, or 4. For example,

 import numpy as np import pandas as pd df = pd.DataFrame({ 'A': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2], 'B': [0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1], 'C': [34, 32, 35, 34, 31, 34, 29, 34, 12, 34, 32, 34], 'D': [11, 15, 22, 15, 9, 15, 11, 15, 14, 15, 13, 15]}) df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'], format='table') print(pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]')) 

gives

  ABCD 0 1 0 34 11 2 3 1 35 22 3 4 1 34 15 5 1 0 34 15 7 3 0 34 15 8 4 1 12 14 10 1 0 32 13 

If you have a very long list of vals values, you can use string formatting to create the correct where argument:

 where='A in {}'.format(vals) 
+7
source

You can do this using pandas.read_hdf ( here ), with the optional where parameter.
For example : read_hdf('store_tl.h5', 'table', where = ['index>2'])

+1
source

All Articles