Run the ID model on multiple GPUs, but send different user data to each GPU

Anyone has success with efficient parallelism data where you send the same model definition to multiple GPUs, but send different user data to each GPU?

It seems that dist-keras can be promising. but I would like to hear feedback on any approaches taken in this direction.

We have user behavioral data: users of 100 thousand, 200 fields (single vectors), 30 000 records per user. We built an RNN using Keras on top of Tensorflow to predict the next action (out of 20 + possible actions), taken for only one user. It takes about 30 minutes to learn on 1 GPU. (There are 8 GPUs in my box). Now we would like to build models for all 100k users.

We were able to execute parallelism data using the Multi GPU approach for single-user data.

But since the model takes 30 minutes per user, and there are 100 thousand users, we want to split the data by the user and run the same model for each user data in a distributed way using a cluster and generate model output for this user.

I am currently using Keras 2.1.x with TensorFlow 1.4.

+8
python tensorflow distributed keras pyspark
source share
1 answer

This is not quite what you are describing, however, something that could work would be to take fragments of each batch and train them on different GPUs separately, taking the model and creating a separate one that does this automatically.

So, let's say, we want to make the model parallel, and then divide its parts during training among the equipment.

def make_parallel(model, gpu_count): """ make a paralellized model from the input model on the given gpu count that splits the input batch amongst the hardware. :param model: The model you want to make parallel :param gpu_count: The gpu count :return: The parellelized model """ def get_slice(data, idx, parts): # take a slice of the batch shape = tf.shape(data) size = tf.concat([shape[:1] // parts, shape[1:]], axis=0) stride = tf.concat([shape[:1] // parts, shape[1:] * 0], axis=0) start = stride * idx return tf.slice(data, start, size) outputs_all = [[] for i in range(len(model.outputs))] # Place a copy of the model on each GPU, each getting a slice of the batch for i in range(gpu_count): with tf.device('/gpu:%d' % i): with tf.name_scope('tower_%d' % i) as scope: inputs = [] for x in model.inputs: input_shape = tuple(x.get_shape().as_list())[1:] slice_n = Lambda(get_slice, output_shape=input_shape, arguments={'idx': i, 'parts': gpu_count})(x) inputs.append(slice_n) outputs = model(inputs) if not isinstance(outputs, list): outputs = [outputs] # Save all outputs to be joined at a later date for l in range(len(outputs)): outputs_all[l].append(outputs[l]) # merge outputs on CPU with tf.device('/cpu:0'): merged = [merge(output, mode='concat', concat_axis=0) for output in outputs_all] return Model(input=model.inputs, output=merged) 

Can you report the results of reverse speed when training on this model?

+3
source share

All Articles