Tensorflow ResourceExhaustedError after the first batch

Question

Tensorflow ResourceExhaustedError after the first batch

Summary and test cases

The main problem is that Tensorflow allocates OOM allocation to a package that is not the first as I expected. Therefore, I believe that there is a memory leak, since all memory is clearly not freed after each batch.

num_units: 50, batch_size: 1000; fails OOM (gpu) before 1st batch as expected
num_units: 50, batch_size: 800, fails OOM (gpu) before 1st batch as expected
num_units: 50, batch_size: 750; fails OOM (gpu) after 10th batch (???)
num_units: 50, batch_size: 500; fails OOM (gpu) after 90th batch (???)
num_units: 50, batch_size: 300; fails OOM (gpu) after 540th batch (???)
num_units: 50, batch_size: 200; computer freezes after around 900 batches with 100% ram use
num_units: 50, batch_size: 100; passes 1 epoch -- may fail later (unknown)

Explanation:

Essentially, it launches a batch 144size package 500before crashing on the 145th batch, which seems strange. If he cannot allocate sufficient memory for the 145th party, why should she work for the first 144? Behavior can be replicated.

Please note that each batch has a size in size, since each of them has dimensions [BATCH_SIZE, MAX_SEQUENCE_LENGTH], and depending on the sample sequences, the length of the sequence changes, but the program does not crash on the largest batch; he fails later on the smaller one. Therefore, I came to the conclusion that one large batch does not cause a memory error; this is apparently a memory leak.

With a larger batch, the program does not work earlier; with a smaller batch size, it does not work later.

Full error:

  Traceback (most recent call last):
  File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
    return fn(*args)
  File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
    status, run_metadata)
  File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[500,80]
     [[Node: decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoder/while/BasicDecoderStep/basic_lstm_cell/concat, decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul/Enter)]]
     [[Node: gradients/Add/_282 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_457_gradients/Add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoder/while/BasicDecoderStep/TrainingHelperNextInputs/add/y/_181)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/nave01314/IdeaProjects/tf-nmt/main.py", line 89, in <module>
    _ = sess.run([update_step])
  File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[500,80]
     [[Node: decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoder/while/BasicDecoderStep/basic_lstm_cell/concat, decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul/Enter)]]
     [[Node: gradients/Add/_282 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_457_gradients/Add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoder/while/BasicDecoderStep/TrainingHelperNextInputs/add/y/_181)]]

Caused by op 'decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul', defined at:
  File "/home/nave01314/IdeaProjects/tf-nmt/main.py", line 49, in <module>
    outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder)
  File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/seq2seq/python/ops/decoder.py", line 309, in dynamic_decode
    swap_memory=swap_memory)
  File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2819, in while_loop
    result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
  File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2643, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2593, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/seq2seq/python/ops/decoder.py", line 254, in body
    decoder_finished) = decoder.step(time, inputs, state)
  File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/seq2seq/python/ops/basic_decoder.py", line 138, in step
    cell_outputs, cell_state = self._cell(inputs, state)
  File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/rnn_cell_impl.py", line 290, in __call__
    return base_layer.Layer.__call__(self, inputs, state, scope=scope)
  File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 618, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/rnn_cell_impl.py", line 567, in call
    array_ops.concat([inputs, h], 1), self._kernel)
  File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 1993, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 2532, in _mat_mul
    name=name)
  File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3081, in create_op
    op_def=op_def)
  File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1528, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[500,80]
     [[Node: decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoder/while/BasicDecoderStep/basic_lstm_cell/concat, decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul/Enter)]]
     [[Node: gradients/Add/_282 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_457_gradients/Add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoder/while/BasicDecoderStep/TrainingHelperNextInputs/add/y/_181)]]

Code snippet (from models.py)

import tensorflow as tf
from tensorflow.python.layers import core as layers_core


class NMTModel:
    def __init__(self, hparams, iterator, mode):
        source, target_in, target_out, source_lengths, target_lengths = iterator.get_next()
        true_batch_size = tf.size(source_lengths)

        # Lookup embeddings
        embedding_encoder = tf.get_variable("embedding_encoder", [hparams.src_vsize, hparams.src_emsize])
        encoder_emb_inp = tf.nn.embedding_lookup(embedding_encoder, source)
        embedding_decoder = tf.get_variable("embedding_decoder", [hparams.tgt_vsize, hparams.tgt_emsize])
        decoder_emb_inp = tf.nn.embedding_lookup(embedding_decoder, target_in)

        # Build and run Encoder LSTM
        encoder_cell = tf.nn.rnn_cell.BasicLSTMCell(hparams.num_units)
        encoder_outputs, encoder_state = tf.nn.dynamic_rnn(encoder_cell, encoder_emb_inp, sequence_length=source_lengths, dtype=tf.float32)

        # Build and run Decoder LSTM with Helper and output projection layer
        decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(hparams.num_units)
        projection_layer = layers_core.Dense(hparams.tgt_vsize, use_bias=False)
        # if mode is 'TRAIN' or mode is 'EVAL':  # then decode using TrainingHelper
        #     helper = tf.contrib.seq2seq.TrainingHelper(decoder_emb_inp, sequence_length=target_lengths)
        # elif mode is 'INFER':  # then decode using Beam Search
        #     helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(embedding_decoder, tf.fill([true_batch_size], hparams.sos), hparams.eos)
        helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(embedding_decoder, tf.fill([true_batch_size], hparams.sos), hparams.eos)
        decoder = tf.contrib.seq2seq.BasicDecoder(decoder_cell, helper, encoder_state, output_layer=projection_layer)
        outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder, maximum_iterations=tf.reduce_max(target_lengths))
        logits = outputs.rnn_output

        if mode is 'TRAIN' or mode is 'EVAL':  # then calculate loss
            crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=target_out, logits=logits)
            target_weights = tf.sequence_mask(target_lengths, maxlen=tf.shape(target_out)[1], dtype=logits.dtype)
            self.loss = tf.reduce_sum((crossent * target_weights)) / tf.cast(true_batch_size, tf.float32)

        if mode is 'TRAIN':  # then calculate/clip gradients, then optimize model
            params = tf.trainable_variables()
            gradients = tf.gradients(self.loss, params)
            clipped_gradients, _ = tf.clip_by_global_norm(gradients, hparams.max_gradient_norm)

            optimizer = tf.train.AdamOptimizer(hparams.l_rate)
            self.update_step = optimizer.apply_gradients(zip(clipped_gradients, params))

        if mode is 'EVAL' or mode is 'INFER':  # then allow access to input/output tensors to printout
            self.src = source
            self.tgt = target_out
            self.preds = tf.argmax(logits, axis=2)

        # Designate a saver operation
        self.saver = tf.train.Saver(tf.global_variables())

    def train(self, sess):
        return sess.run([self.update_step, self.loss])

    def eval(self, sess):
        return sess.run([self.loss, self.src, self.tgt, self.preds])

    def infer(self, sess):
        return sess.run([self.src, self.tgt, self.preds])  # tgt should not exist (temporary debugging only)

Full code (very similar to NMT tutorial, simplified).

The model code is in models.py, the iterator code is in data_pipeline.py, main - main.py.

https://github.com/nave01314/tf-nmt

+6

python python-3.x tensorflow

Evan Weissburg 10 . '17 22:01

2

tf.GraphDef 2 , OOM.

[BATCH_SIZE, MAX_SEQUENCE_LENGTH], , . .

+1

J.E.K 14 . '17 21:39

Evan Weissburg · Accepted Answer · 2018-01-19T00:08:52+0000

, OOM, .

( , ) , .

, .

.

Tensorflow ResourceExhaustedError after the first batch

More articles: