BerkeleyDB Performance Issues

I need a disk keystore that can support high write and read performance for large datasets. High order, I know.

I am trying to use the BerkeleyDB C library (5.1.25) from java and I see serious performance issues.

I get solid 14K docs / s for a short time, but as soon as I get to several hundred thousand documents, performance drops like a rock, then recovers for a while, then falls again, etc. This happens more and more often, until most of the time I can get more than 60 documents / with several isolated peaks from 12 thousand documents / s after 10 million documents. My choice of db type is HASH, but I also tried BTREE, and it is the same.

I tried using a 10dB pool and hashing documents between them to smooth out a performance hit; this increased the recording bandwidth to 50 thousand documents / s, but did not help to reduce productivity: all 10 dB slowed down simultaneously with the bypass.

I assume the files are being reorganized, and I tried to find a configuration parameter that affects when this reorganization takes place, so each of the db pool will be reorganized at a different time, but I could not find anything that worked, I tried different sizes cache by reserving space using the setHashNumElements configuration option so that it does not waste time expanding the file, but each setting made it much worse.

I am going to give berkeleydb and try more complex solutions like cassandra, but I want to make sure that I am not doing something wrong with berkeleydb before writing it down.

Anyone here with experience achieving sustainable recording performance with berkeleydb?

Change 1 :

I have tried several times:

  • Throttling records to 500 / s (less than the average value that I received after recording 30 million documents in 15 hours, which indicates that the hardware is capable of recording 550 documents / s). It doesn’t work: after a certain number of documents have been recorded, productivity drops independently.
  • Write incoming items to the queue. This has two problems: A) He defeats the goal of freeing the ram. B) the queue eventually gets blocked because the periods during which BerkeleyDB freezes become longer and more frequent.

In other words, even if I throttle the incoming data to stay below the hardware capabilities and use ram to store items, while BerkeleyDB takes some time to adapt to growth, as this time gets longer, the performance approaches 0.

This surprises me because I saw claims that it can handle terabytes of data, but my tests show differently. I still hope I'm doing something wrong ...

Edit 2 :

After we added a few more considerations with the input of Peter, I now understand that as the file gets larger, the batch of records will be distributed further apart, and the likelihood that they will fall into the same disk cylinder, will fall until it reaches the file / second access limit.

But BerkeleyDB’s periodic reorganization kills performance much earlier than that, and much worse: it just stops responding for longer and longer periods of time while it mixes things around. Using faster disks or distributing database files between different disks does not help. I need to find a way to these through holes.

+8
java performance key-value berkeley-db
source share
5 answers

What I saw at a high speed of writing to disk is that the system cache will fill up (giving brightness to this point), but as soon as it fills the application, even the whole system can slow down dramatically and even stop.

Your primary physical disk must support at least 100 records per second. Moreover, this is an illusion supported by clearer caching .;) However, when the caching system is exhausted, you will see very bad behavior.

I suggest you consider the controller disk cache. Its battery backup should be the size of your data.

Another option is to use SSDs if the updates are explosive (they can do 10K + records per second because they have no moving parts) with caching, this should give you more than you need, but the SSD has a limited number of records.

+2
source share

BerkeleyDB does not reorganize files unless you manually invoke the compaction utility. There are several reasons for the slowdown:

  • Writes keys in random access mode, which leads to a significantly higher load on the input / output disk.
  • The default entries are durable, which forces a lot of extra flash drives.
  • A transactional environment is used, in which case breakpoints cause a slowdown when flushing changes to disk.

When you say “documents”, you want to say that you use BDB to store records larger than a few kilobytes? BDB overflow pages have more overhead, so you should consider using a larger page size.

+1
source share

We used BerkeleyDB (BDB) at work and there seemed to be similar performance trends. BerkeleyDB uses Btree to store its key / value pairs. When the number of entries increases, the depth of the tree increases. BerkeleyDB caching works when loading trees into RAM, so traversing a tree does not require entering an IO file (read from disk).

+1
source share

This is an old question, and the problem probably disappeared, but I recently had similar problems (the insertion speed decreased sharply after several hundred thousand records), and they were solved by providing more cache to the database (DB-> set_cachesize). With 2 GB of cache, the insertion speed was very good and more or less constant up to 10 million records (I have not tested it yet).

+1
source share

I need a storage of key values ​​on disk that can support high write and read performance for large data sets.

Chronicle Map is a modern solution to this problem. It is much faster than BerkeleyDB, both in reading and writing, and much more scalable in terms of concurrent access from multiple threads / processes.

0
source share

All Articles