Among the things I've tried so far, the PyTables solution is currently the best, followed by a solution that uses numpy support for memmapped arrays. However, the PyTables solution is not simple. If you use a shuffled array of integers to directly index the PyTables array, it is very slow. The following two-step process is much faster:
- Select a random subset of the array using an array of Boolean indexes. This should be done in order. If you pass the index array directly to the PyTables array, it will be slow.
- Define a numpy array and create a list of slices that break the PyTables array into pieces.
- Read each fragment entirely in memory, and then use the corresponding block in the index array to select the correct values ββfor that fragment.
- Store selected values ββin a pre-allocated array.
- Then shuffle the previously allocated array.
This process creates a permutation as random as a regular shuffling process. If this does not seem obvious, consider this: (n choose x) * x! = x! * n! / (x! * (n - x)!) = n! / (n - x)! (n choose x) * x! = x! * n! / (x! * (n - x)!) = n! / (n - x)! . This method is fast enough to shuffle each training cycle. It is also capable of compressing data up to ~ 650M - almost 90% of deflation.
Here is my current implementation; this is called once for each training piece in the enclosure. (The returned arrays are shuffled elsewhere.)
def _h5_fast_bool_ix(self, h5_array, ix, read_chunksize=100000): '''Iterate over an h5 array chunkwise to select a random subset of the array. `h5_array` should be the array itself; `ix` should be a boolean index array with as many values as `h5_array` has rows; and you can optionally set the number of rows to read per chunk with `read_chunksize` (default is 100000). For some reason this is much faster than using `ix` to index the array directly.''' n_chunks = h5_array.shape[0] / read_chunksize slices = [slice(i * read_chunksize, (i + 1) * read_chunksize) for i in range(n_chunks)] a = numpy.empty((ix.sum(), h5_array.shape[1]), dtype=float) a_start = 0 for sl in slices: chunk = h5_array[sl][ix[sl]] a_end = a_start + chunk.shape[0] a[a_start:a_end] = chunk a_start = a_end return a
It seems to me that the O (n ^ 2) approach (iterating over the entire PyTables array for each fragment) in this case is faster than the O (n) approach (random selection of each row in one pass). But hey, it works. With a little more indirect, it can be adapted to load arbitrary non-random permutations, but it adds more complexity than it costs here.
The mmap solution is here for reference, for those people who, for whatever reason, need a clean numpy solution. It moves all the data in about 25 minutes, while the above solution manages the same in less than half that time. It also needs to scale linearly because mmap allows (relatively) efficient random access.
import numpy import os import random X = [] Y = [] for filename in os.listdir('input'): X.append(numpy.load(os.path.join('input', filename), mmap_mode='r')) for filename in os.listdir('output'): Y.append(numpy.load(os.path.join('output', filename), mmap_mode='r')) indices = [(chunk, row) for chunk, rows in enumerate(X) for row in range(rows.shape[0])] random.shuffle(indices) newchunks = 50 newchunksize = len(indices) / newchunks for i in range(0, len(indices), newchunksize): print i rows = [X[chunk][row] for chunk, row in indices[i:i + newchunksize]] numpy.save('X_shuffled_' + str(i), numpy.array(rows)) rows = [Y[chunk][row] for chunk, row in indices[i:i + newchunksize]] numpy.save('Y_shuffled_' + str(i), numpy.array(rows))