How do you upload an LMDB file to TensorFlow?

I have a large (1 TB) dataset, divided into approximately 3000 CSV files. My plan is to convert it to one large LMDB file so that it can be quickly read to train a neural network. However, I could not find the documentation on how to load the LMDB file into TensorFlow. Does anyone know how to do this? I know that TensorFlow can read CSV files, but I think it will be too slow.

+4
source share
1 answer

According to this , there are several ways to read data in TensorFlow.

The easiest is to submit your data through placeholders. When using fillers, the responsibility for shuffling and dosing is your responsibility.

If you want to delegate shuffling and batch processing to the framework, you need to create an input pipeline. The problem is how you enter lmdb data into the symbolic input pipeline. A possible solution is to use an operation tf.py_func. Here is an example:

def create_input_pipeline(lmdb_env, keys, num_epochs=10, batch_size=64):
   key_producer = tf.train.string_input_producer(keys, 
                                                 num_epochs=num_epochs,
                                                 shuffle=True)
   single_key = key_producer.dequeue()

   def get_bytes_from_lmdb(key):
      with lmdb_env.begin() as txn:
         lmdb_val = txn.get(key)
      example = get_example_from_val(lmdb_val) # A single example (numpy array)
      label = get_label_from_val(lmdb_val)     # The label, could be a scalar
      return example, label

   single_example, single_label = tf.py_func(get_bytes_from_lmdb,
                                             [single_key], [tf.float32, tf.float32])
   # if you know the shapes of the tensors you can set them here:
   # single_example.set_shape([224,224,3])

   batch_examples, batch_labels = tf.train.batch([single_example, single_label],
                                                 batch_size)
   return batch_examples, batch_labels

The operator tf.py_funcinserts a call to regular Python code inside the TensorFlow graph, we need to specify the inputs and the number and types of outputs. tf.train.string_input_producercreates a shuffled queue with the specified keys. The operator tf.train.batchcreates another queue containing data lots. During training, each assessment batch_examplesor batch_labelscancels the next batch from this queue.

, QueueRunner . ( TensorFlow):

# Create the graph, etc.
init_op = tf.initialize_all_variables()

# Create a session for running operations in the Graph.
sess = tf.Session()

# Initialize the variables (like the epoch counter).
sess.run(init_op)

# Start input enqueue threads.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)

try:
    while not coord.should_stop():
        # Run training steps or whatever
        sess.run(train_op)

except tf.errors.OutOfRangeError:
    print('Done training -- epoch limit reached')
finally:
    # When done, ask the threads to stop.
    coord.request_stop()

# Wait for threads to finish.
coord.join(threads)
sess.close()
+6

All Articles