Writing data in LMDB with Python is very slow

Creating Training Datasets with Caffe I tried using HDF5 and LMDB. However, the creation of LMDB is very slow, even slower than HDF5. I am trying to write ~ 20,000 images.

Am I doing something terribly wrong? Is there something I don't know about?

This is my code for creating LMDB:

DB_KEY_FORMAT = "{:0>10d}" db = lmdb.open(path, map_size=int(1e12)) curr_idx = 0 commit_size = 1000 for curr_commit_idx in range(0, num_data, commit_size): with in_db_data.begin(write=True) as in_txn: for i in range(curr_commit_idx, min(curr_commit_idx + commit_size, num_data)): d, l = data[i], labels[i] im_dat = caffe.io.array_to_datum(d.astype(float), label=int(l)) key = DB_KEY_FORMAT.format(curr_idx) in_txn.put(key, im_dat.SerializeToString()) curr_idx += 1 db.close() 

As you can see, I create a transaction for every 1000 images, because I thought that creating a transaction for each image would create overhead, but it does not seem to affect the performance too much.

+6
source share
3 answers

In my experience, I had 50-100 ms writing LMDB from Python , writing Caffe data to an ext4 hard drive on Ubuntu. This is why I use the tmpfs ( RAM disk , built-in Linux) functions , and make these entries in 0.07 ms . You can create smaller databases on your ramdisk and copy them to your hard drive, and then train them all. I am doing about 20-40 GB as I have 64 GB of RAM.

Some snippets of code to help you guys dynamically create, populate, and move LMDBs to memory. Feel free to edit it to fit your case. This should save you time on how LMDB and file processing work in Python.

 import shutil import lmdb import random def move_db(): global image_db image_db.close(); rnd = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(5)) shutil.move( fold + 'ram/train_images', '/storage/lmdb/'+rnd) open_db() def open_db(): global image_db image_db = lmdb.open(os.path.join(fold, 'ram/train_images'), map_async=True, max_dbs=0) def write_to_lmdb(db, key, value): """ Write (key,value) to db """ success = False while not success: txn = db.begin(write=True) try: txn.put(key, value) txn.commit() success = True except lmdb.MapFullError: txn.abort() # double the map_size curr_limit = db.info()['map_size'] new_limit = curr_limit*2 print '>>> Doubling LMDB map size to %sMB ...' % (new_limit>>20,) db.set_mapsize(new_limit) # double it ... image_datum = caffe.io.array_to_datum( transformed_image, label ) write_to_lmdb(image_db, str(itr), image_datum.SerializeToString()) 
+6
source

Try the following:

 DB_KEY_FORMAT = "{:0>10d}" db = lmdb.open(path, map_size=int(1e12)) curr_idx = 0 commit_size = 1000 with in_db_data.begin(write=True) as in_txn: for curr_commit_idx in range(0, num_data, commit_size): for i in range(curr_commit_idx, min(curr_commit_idx + commit_size, num_data)): d, l = data[i], labels[i] im_dat = caffe.io.array_to_datum(d.astype(float), label=int(l)) key = DB_KEY_FORMAT.format(curr_idx) in_txn.put(key, im_dat.SerializeToString()) curr_idx += 1 db.close() 

code

 with in_db_data.begin(write=True) as in_txn: 

take a lot of time.

+3
source

LMDBs are very order-sensitive - if you can sort the data before the insertion speed improves significantly

0
source

All Articles