Keras: Batch Learning for Many Big Data Sets

This issue addresses the general issue of learning on several large files in Keras that are too large to fit on GPU memory. I am using Keras 1.0.5 and I need a solution that does not require 1.0.6. One way to do this was described by fchollet here and here :

# Create generator that yields (current features X, current labels y) def BatchGenerator(files): for file in files: current_data = pickle.load(open("file", "rb")) X_train = current_data[:,:-1] y_train = current_data[:,-1] yield (X_train, y_train) # train model on each dataset for epoch in range(n_epochs): for (X_train, y_train) in BatchGenerator(files): model.fit(X_train, y_train, batch_size = 32, nb_epoch = 1) 

However, I am afraid that the state of the model will not be saved, rather, that the model is reinitialized not only between eras, but also between data sets. Each 1/1 Age represents training in a different dataset below:

~~~~~ Age 0 ~~~~~~

Age 1/1 295806/295806 [===============================] 13 s - loss: 15.7517
Age 1/1 407890/407890 [===============================] - 19s - loss: 15.8036
Age 1/1 383188/383188 [======================================== 19 - loss : 15.8130
~~~~~ Age 1 ~~~~~~

Age 1/1 295806/295806 [===============================] - 14s - loss: 15.7517
Age 1/1 407890/407890. Age 1/1 383188/383188.

I know that you can use model.fit_generator, but since the method above has been repeatedly proposed as a way of batch training, I would like to know what I'm doing wrong.

Thanks for your help,

Max

+6
source share
1 answer

It has been a while since I ran into this problem, but I remember that I used Kera Functionality to provide data through Python generators , i.e. model = Sequential(); model.fit_generator(...) model = Sequential(); model.fit_generator(...) .

Sample code snippet (should be clear)

 def generate_batches(files, batch_size): counter = 0 while True: fname = files[counter] print(fname) counter = (counter + 1) % len(files) data_bundle = pickle.load(open(fname, "rb")) X_train = data_bundle[0].astype(np.float32) y_train = data_bundle[1].astype(np.float32) y_train = y_train.flatten() for cbatch in range(0, X_train.shape[0], batch_size): yield (X_train[cbatch:(cbatch + batch_size),:,:], y_train[cbatch:(cbatch + batch_size)]) model = Sequential() model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy']) train_files = [train_bundle_loc + "bundle_" + cb.__str__() for cb in range(nb_train_bundles)] gen = generate_batches(files=train_files, batch_size=batch_size) history = model.fit_generator(gen, samples_per_epoch=samples_per_epoch, nb_epoch=num_epoch,verbose=1, class_weight=class_weights) 
0
source

All Articles