H5py: The proper way to slice arrays of arrays

Question

H5py: The proper way to slice arrays of arrays

I'm a little confused here:

As I understand it, the h5py .value method reads the entire data set and uploads it to an array that is slow and discouraged (and should usually be replaced with [()] . The correct way is to use a numpy sketch.

However, I get annoying results (with h5py 2.2.1):

 import h5py import numpy as np >>> file = h5py.File("test.hdf5",'w') # Just fill a test file with a numpy array test dataset >>> file["test"] = np.arange(0,300000) # This is TERRIBLY slow?! >>> file["test"][range(0,300000)] array([ 0, 1, 2, ..., 299997, 299998, 299999]) # This is fast >>> file["test"].value[range(0,300000)] array([ 0, 1, 2, ..., 299997, 299998, 299999]) # This is also fast >>> file["test"].value[np.arange(0,300000)] array([ 0, 1, 2, ..., 299997, 299998, 299999]) # This crashes >>> file["test"][np.arange(0,300000)]

I assume that my dataset is so small that the .value does not interfere with performance significantly, but how can the first option be so slow? What is the preferred version here?

Thanks!

UPDATE It seems I was not clear enough, sorry. I know that .value copies the entire data set into memory, while slicing only extracts the appropriate sub-part. I am wondering why cutting in a file is slower than copying the whole array and then cutting in memory. I always thought that hdf5 / h5py was implemented specifically, so truncated substrings would always be the fastest.

+8

python numpy h5py

Jiayow Feb 13 '14 at 21:44

source share

3 answers

Based on the header of your post, the “right” way to set arrays of arrays of arrays is to use the built-in slice notation

All your answers will be equivalent to the file ["test"] [:]

[:] selects all elements of the array

More information on notation notation can be found here, http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html

I often use hdf5 + python, I never had to use .value methods. When you access a dataset in an array, such as myarr = file ["test"]

python copies the dataset in hdf5 into an array for you already.

+2

abnowack Feb 13 '14 at 22:49

source share

The .value method copies data to memory as a numpy array. Try comparing type(file["test"]) with type(file["test"].value) : the first should be an HDF5 dataset, and the second should be a numpy array.

I am not familiar with the internal components of h5py or HDF5 to say exactly why some dataset operations are slow; but the reason these two are different is because in one case you are slicing a numpy array in memory, and in the other, you are slicing an HDF5 dataset from disk.

+2

Channing moore Feb 14 '14 at 1:42

source share

Andrew Collette · Accepted Answer · 2014-02-14T22:24:57+0000

For quick slices with h5py, stick with the plain-vanilla notation:

 file['test'][0:300000]

or, for example, reading every other element:

 file['test'][0:300000:2]

A simple breakdown (slice objects and single integer indices) should be very fast, as it translates directly to the HDF5 hyperlink selection.

The expression file['test'][range(300000)] invokes the h5py version of "fancy indexing", namely indexing through an explicit list of indexes. There is no native way to do this in HDF5, so h5py implements a (slower) method in Python, which unfortunately has terrible performance when lists are> 1000 elements. Similarly for file['test'][np.arange(300000)] , which is interpreted in the same way.

H5py: The proper way to slice arrays of arrays

More articles: