This is not quite what you are describing, however, something that could work would be to take fragments of each batch and train them on different GPUs separately, taking the model and creating a separate one that does this automatically.
So, let's say, we want to make the model parallel, and then divide its parts during training among the equipment.
def make_parallel(model, gpu_count): """ make a paralellized model from the input model on the given gpu count that splits the input batch amongst the hardware. :param model: The model you want to make parallel :param gpu_count: The gpu count :return: The parellelized model """ def get_slice(data, idx, parts): # take a slice of the batch shape = tf.shape(data) size = tf.concat([shape[:1] // parts, shape[1:]], axis=0) stride = tf.concat([shape[:1] // parts, shape[1:] * 0], axis=0) start = stride * idx return tf.slice(data, start, size) outputs_all = [[] for i in range(len(model.outputs))] # Place a copy of the model on each GPU, each getting a slice of the batch for i in range(gpu_count): with tf.device('/gpu:%d' % i): with tf.name_scope('tower_%d' % i) as scope: inputs = [] for x in model.inputs: input_shape = tuple(x.get_shape().as_list())[1:] slice_n = Lambda(get_slice, output_shape=input_shape, arguments={'idx': i, 'parts': gpu_count})(x) inputs.append(slice_n) outputs = model(inputs) if not isinstance(outputs, list): outputs = [outputs] # Save all outputs to be joined at a later date for l in range(len(outputs)): outputs_all[l].append(outputs[l]) # merge outputs on CPU with tf.device('/cpu:0'): merged = [merge(output, mode='concat', concat_axis=0) for output in outputs_all] return Model(input=model.inputs, output=merged)
Can you report the results of reverse speed when training on this model?
modesitt
source share