Summary and test cases
The main problem is that Tensorflow allocates OOM allocation to a package that is not the first as I expected. Therefore, I believe that there is a memory leak, since all memory is clearly not freed after each batch.
num_units: 50, batch_size: 1000; fails OOM (gpu) before 1st batch as expected
num_units: 50, batch_size: 800, fails OOM (gpu) before 1st batch as expected
num_units: 50, batch_size: 750; fails OOM (gpu) after 10th batch (???)
num_units: 50, batch_size: 500; fails OOM (gpu) after 90th batch (???)
num_units: 50, batch_size: 300; fails OOM (gpu) after 540th batch (???)
num_units: 50, batch_size: 200; computer freezes after around 900 batches with 100% ram use
num_units: 50, batch_size: 100; passes 1 epoch -- may fail later (unknown)
Explanation:
Essentially, it launches a batch 144size package 500before crashing on the 145th batch, which seems strange. If he cannot allocate sufficient memory for the 145th party, why should she work for the first 144? Behavior can be replicated.
Please note that each batch has a size in size, since each of them has dimensions [BATCH_SIZE, MAX_SEQUENCE_LENGTH], and depending on the sample sequences, the length of the sequence changes, but the program does not crash on the largest batch; he fails later on the smaller one. Therefore, I came to the conclusion that one large batch does not cause a memory error; this is apparently a memory leak.
With a larger batch, the program does not work earlier; with a smaller batch size, it does not work later.
Full error:
Traceback (most recent call last):
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
return fn(*args)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
status, run_metadata)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[500,80]
[[Node: decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoder/while/BasicDecoderStep/basic_lstm_cell/concat, decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul/Enter)]]
[[Node: gradients/Add/_282 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_457_gradients/Add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoder/while/BasicDecoderStep/TrainingHelperNextInputs/add/y/_181)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/nave01314/IdeaProjects/tf-nmt/main.py", line 89, in <module>
_ = sess.run([update_step])
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[500,80]
[[Node: decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoder/while/BasicDecoderStep/basic_lstm_cell/concat, decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul/Enter)]]
[[Node: gradients/Add/_282 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_457_gradients/Add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoder/while/BasicDecoderStep/TrainingHelperNextInputs/add/y/_181)]]
Caused by op 'decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul', defined at:
File "/home/nave01314/IdeaProjects/tf-nmt/main.py", line 49, in <module>
outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/seq2seq/python/ops/decoder.py", line 309, in dynamic_decode
swap_memory=swap_memory)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2819, in while_loop
result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2643, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2593, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/seq2seq/python/ops/decoder.py", line 254, in body
decoder_finished) = decoder.step(time, inputs, state)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/seq2seq/python/ops/basic_decoder.py", line 138, in step
cell_outputs, cell_state = self._cell(inputs, state)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/rnn_cell_impl.py", line 290, in __call__
return base_layer.Layer.__call__(self, inputs, state, scope=scope)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 618, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/rnn_cell_impl.py", line 567, in call
array_ops.concat([inputs, h], 1), self._kernel)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 1993, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 2532, in _mat_mul
name=name)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3081, in create_op
op_def=op_def)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1528, in __init__
self._traceback = self._graph._extract_stack()
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[500,80]
[[Node: decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoder/while/BasicDecoderStep/basic_lstm_cell/concat, decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul/Enter)]]
[[Node: gradients/Add/_282 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_457_gradients/Add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoder/while/BasicDecoderStep/TrainingHelperNextInputs/add/y/_181)]]
Code snippet (from models.py)
import tensorflow as tf
from tensorflow.python.layers import core as layers_core
class NMTModel:
def __init__(self, hparams, iterator, mode):
source, target_in, target_out, source_lengths, target_lengths = iterator.get_next()
true_batch_size = tf.size(source_lengths)
embedding_encoder = tf.get_variable("embedding_encoder", [hparams.src_vsize, hparams.src_emsize])
encoder_emb_inp = tf.nn.embedding_lookup(embedding_encoder, source)
embedding_decoder = tf.get_variable("embedding_decoder", [hparams.tgt_vsize, hparams.tgt_emsize])
decoder_emb_inp = tf.nn.embedding_lookup(embedding_decoder, target_in)
encoder_cell = tf.nn.rnn_cell.BasicLSTMCell(hparams.num_units)
encoder_outputs, encoder_state = tf.nn.dynamic_rnn(encoder_cell, encoder_emb_inp, sequence_length=source_lengths, dtype=tf.float32)
decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(hparams.num_units)
projection_layer = layers_core.Dense(hparams.tgt_vsize, use_bias=False)
helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(embedding_decoder, tf.fill([true_batch_size], hparams.sos), hparams.eos)
decoder = tf.contrib.seq2seq.BasicDecoder(decoder_cell, helper, encoder_state, output_layer=projection_layer)
outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder, maximum_iterations=tf.reduce_max(target_lengths))
logits = outputs.rnn_output
if mode is 'TRAIN' or mode is 'EVAL':
crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=target_out, logits=logits)
target_weights = tf.sequence_mask(target_lengths, maxlen=tf.shape(target_out)[1], dtype=logits.dtype)
self.loss = tf.reduce_sum((crossent * target_weights)) / tf.cast(true_batch_size, tf.float32)
if mode is 'TRAIN':
params = tf.trainable_variables()
gradients = tf.gradients(self.loss, params)
clipped_gradients, _ = tf.clip_by_global_norm(gradients, hparams.max_gradient_norm)
optimizer = tf.train.AdamOptimizer(hparams.l_rate)
self.update_step = optimizer.apply_gradients(zip(clipped_gradients, params))
if mode is 'EVAL' or mode is 'INFER':
self.src = source
self.tgt = target_out
self.preds = tf.argmax(logits, axis=2)
self.saver = tf.train.Saver(tf.global_variables())
def train(self, sess):
return sess.run([self.update_step, self.loss])
def eval(self, sess):
return sess.run([self.loss, self.src, self.tgt, self.preds])
def infer(self, sess):
return sess.run([self.src, self.tgt, self.preds])
Full code (very similar to NMT tutorial, simplified).
The model code is in models.py, the iterator code is in data_pipeline.py, main - main.py.
https://github.com/nave01314/tf-nmt