I'm a little confused here:
As I understand it, the h5py .value method reads the entire data set and uploads it to an array that is slow and discouraged (and should usually be replaced with [()] . The correct way is to use a numpy sketch.
However, I get annoying results (with h5py 2.2.1):
import h5py import numpy as np >>> file = h5py.File("test.hdf5",'w') # Just fill a test file with a numpy array test dataset >>> file["test"] = np.arange(0,300000) # This is TERRIBLY slow?! >>> file["test"][range(0,300000)] array([ 0, 1, 2, ..., 299997, 299998, 299999]) # This is fast >>> file["test"].value[range(0,300000)] array([ 0, 1, 2, ..., 299997, 299998, 299999]) # This is also fast >>> file["test"].value[np.arange(0,300000)] array([ 0, 1, 2, ..., 299997, 299998, 299999]) # This crashes >>> file["test"][np.arange(0,300000)]
I assume that my dataset is so small that the .value does not interfere with performance significantly, but how can the first option be so slow? What is the preferred version here?
Thanks!
UPDATE It seems I was not clear enough, sorry. I know that .value copies the entire data set into memory, while slicing only extracts the appropriate sub-part. I am wondering why cutting in a file is slower than copying the whole array and then cutting in memory. I always thought that hdf5 / h5py was implemented specifically, so truncated substrings would always be the fastest.
python numpy h5py
Jiayow
source share