Shuffle 5 gigabytes of numpy data evenly

Question

Shuffle 5 gigabytes of numpy data evenly

I train a neural network with about five gigabytes of data stored as numpy arrays. The data is divided into pieces of 100,000 rows, and I spent six training cycles in all pieces in random order. Unfortunately, the network began to gain momentum. I think that he still has the ability to more accurately select data; my suspicion is that the internal patterns within each piece are starting to contradict each other, and I need to shuffle the data more carefully so that he can train in different combinations. I want to try this before going to get more training data.

Does anyone know a good way to generate a new permutation of 3.6 million (very long) numpy data lines? I was thinking of using one of these methods, but writing these arrays using numpy.savetxt leads to incredibly huge files, and I cannot describe how to manipulate individual lines from standard npy to help solve this problem.

Right now, my best idea is to create a permutation of the paired indices (c, r) into the data, where c chots chunk and r knocks a row from this fragment. I could save each row in a new preallocated array and then save it. But I am wondering if there is a less terrible I / O solution. Is there any fundamental way to shuffle random pairs of pieces together until you get a permutation that is statistically independent of the initial permutation?

+7

python numpy machine-learning shuffle

senderle Nov 20 '14 at 21:23

source share

2 answers

It is further assumed that your data is already divided into easily retrievable records of some kind. (I don't know if there is a standard file format for numpy data.)

Create a data index in the form of a dict , matching each unique record identifier (0 to n - 1) with some ways to search for data again. For example, if everything is in one binary file, you save a tuple of the form (file_offset, record_length) . No need to store data directly.
Create a list of n items containing dict index keys (again, from 0 to n - 1).
Shuffle the list of record identifiers. (If necessary, create your own random number generator.)
Open a new file (or something else) to contain the shuffled data.
Read the record IDs from the list from beginning to end. For each record identifier, find this record location in the index. Take the data in this place and add it to the output file.

Pseudo Code:

 # This assumes a binary file of unequal-length # records. It also assumes that the file won't # be changed while we're doing this. # Create index. index = {} rec_offset = 0 for rec_id, record in original_data.iterate_records(): # This bit depends greatly on how your data # is stored... rec_length = len(record) index[rec_id] = (rec_offset, rec_length) rec_offset += rec_length # Shuffle. num_records_indexed = rec_id + 1 # rec_id is still in scope. records_order = list(range(num_records_indexed)) records_order = random.shuffle(records_order, "<optional_RNG_here>") # Create new shuffled-data file. with open("output_file.bin", "wb") as output: for rec_id in records_order: rec_offset, rec_length = index[rec_id] record = original_data.get_rec_at(rec_offset, rec_length) output.write(record)

Indexing, shuffling, and de-indexing are all O (n), so I / O should be the worst part: reading data and then copying (second read, plus write).

0

Kevin J. Chase Nov 21 '14 at 2:20

source share

senderle · Accepted Answer · 2014-11-27T16:34:47+0000

Among the things I've tried so far, the PyTables solution is currently the best, followed by a solution that uses numpy support for memmapped arrays. However, the PyTables solution is not simple. If you use a shuffled array of integers to directly index the PyTables array, it is very slow. The following two-step process is much faster:

Select a random subset of the array using an array of Boolean indexes. This should be done in order. If you pass the index array directly to the PyTables array, it will be slow.
- Define a numpy array and create a list of slices that break the PyTables array into pieces.
- Read each fragment entirely in memory, and then use the corresponding block in the index array to select the correct values for that fragment.
- Store selected values in a pre-allocated array.
Then shuffle the previously allocated array.

This process creates a permutation as random as a regular shuffling process. If this does not seem obvious, consider this: (n choose x) * x! = x! * n! / (x! * (n - x)!) = n! / (n - x)! (n choose x) * x! = x! * n! / (x! * (n - x)!) = n! / (n - x)! . This method is fast enough to shuffle each training cycle. It is also capable of compressing data up to ~ 650M - almost 90% of deflation.

Here is my current implementation; this is called once for each training piece in the enclosure. (The returned arrays are shuffled elsewhere.)

 def _h5_fast_bool_ix(self, h5_array, ix, read_chunksize=100000): '''Iterate over an h5 array chunkwise to select a random subset of the array. `h5_array` should be the array itself; `ix` should be a boolean index array with as many values as `h5_array` has rows; and you can optionally set the number of rows to read per chunk with `read_chunksize` (default is 100000). For some reason this is much faster than using `ix` to index the array directly.''' n_chunks = h5_array.shape[0] / read_chunksize slices = [slice(i * read_chunksize, (i + 1) * read_chunksize) for i in range(n_chunks)] a = numpy.empty((ix.sum(), h5_array.shape[1]), dtype=float) a_start = 0 for sl in slices: chunk = h5_array[sl][ix[sl]] a_end = a_start + chunk.shape[0] a[a_start:a_end] = chunk a_start = a_end return a

It seems to me that the O (n ^ 2) approach (iterating over the entire PyTables array for each fragment) in this case is faster than the O (n) approach (random selection of each row in one pass). But hey, it works. With a little more indirect, it can be adapted to load arbitrary non-random permutations, but it adds more complexity than it costs here.

The mmap solution is here for reference, for those people who, for whatever reason, need a clean numpy solution. It moves all the data in about 25 minutes, while the above solution manages the same in less than half that time. It also needs to scale linearly because mmap allows (relatively) efficient random access.

 import numpy import os import random X = [] Y = [] for filename in os.listdir('input'): X.append(numpy.load(os.path.join('input', filename), mmap_mode='r')) for filename in os.listdir('output'): Y.append(numpy.load(os.path.join('output', filename), mmap_mode='r')) indices = [(chunk, row) for chunk, rows in enumerate(X) for row in range(rows.shape[0])] random.shuffle(indices) newchunks = 50 newchunksize = len(indices) / newchunks for i in range(0, len(indices), newchunksize): print i rows = [X[chunk][row] for chunk, row in indices[i:i + newchunksize]] numpy.save('X_shuffled_' + str(i), numpy.array(rows)) rows = [Y[chunk][row] for chunk, row in indices[i:i + newchunksize]] numpy.save('Y_shuffled_' + str(i), numpy.array(rows))

Shuffle 5 gigabytes of numpy data evenly

More articles: