I need a disk keystore that can support high write and read performance for large datasets. High order, I know.
I am trying to use the BerkeleyDB C library (5.1.25) from java and I see serious performance issues.
I get solid 14K docs / s for a short time, but as soon as I get to several hundred thousand documents, performance drops like a rock, then recovers for a while, then falls again, etc. This happens more and more often, until most of the time I can get more than 60 documents / with several isolated peaks from 12 thousand documents / s after 10 million documents. My choice of db type is HASH, but I also tried BTREE, and it is the same.
I tried using a 10dB pool and hashing documents between them to smooth out a performance hit; this increased the recording bandwidth to 50 thousand documents / s, but did not help to reduce productivity: all 10 dB slowed down simultaneously with the bypass.
I assume the files are being reorganized, and I tried to find a configuration parameter that affects when this reorganization takes place, so each of the db pool will be reorganized at a different time, but I could not find anything that worked, I tried different sizes cache by reserving space using the setHashNumElements configuration option so that it does not waste time expanding the file, but each setting made it much worse.
I am going to give berkeleydb and try more complex solutions like cassandra, but I want to make sure that I am not doing something wrong with berkeleydb before writing it down.
Anyone here with experience achieving sustainable recording performance with berkeleydb?
Change 1 :
I have tried several times:
- Throttling records to 500 / s (less than the average value that I received after recording 30 million documents in 15 hours, which indicates that the hardware is capable of recording 550 documents / s). It doesn’t work: after a certain number of documents have been recorded, productivity drops independently.
- Write incoming items to the queue. This has two problems: A) He defeats the goal of freeing the ram. B) the queue eventually gets blocked because the periods during which BerkeleyDB freezes become longer and more frequent.
In other words, even if I throttle the incoming data to stay below the hardware capabilities and use ram to store items, while BerkeleyDB takes some time to adapt to growth, as this time gets longer, the performance approaches 0.
This surprises me because I saw claims that it can handle terabytes of data, but my tests show differently. I still hope I'm doing something wrong ...
Edit 2 :
After we added a few more considerations with the input of Peter, I now understand that as the file gets larger, the batch of records will be distributed further apart, and the likelihood that they will fall into the same disk cylinder, will fall until it reaches the file / second access limit.
But BerkeleyDB’s periodic reorganization kills performance much earlier than that, and much worse: it just stops responding for longer and longer periods of time while it mixes things around. Using faster disks or distributing database files between different disks does not help. I need to find a way to these through holes.