Kyoto / Berkeley DB: hash table size limits

It's hard for me to store hundreds of millions of 16/32 byte key / value pairs with a hash array on my SSD.

From Kyoto Cabinet : When it works fine, it inserts 70,000 records per second. Once it falls, it will drop to 10-500 records / s. With default settings, a crash occurs after a million records. Looking at the documentation, this is the default number of buckets in the array, so it makes sense. I increased this number to 25 million, and indeed, it works fine to about 25 million records. The problem is that as soon as I press the number of buckets to 30 million or more, the insertion speed decreases to 10-500 records / s from the beginning. Kyoto Cabinet is not intended to increase the number of bucket after creating the database, so I can not insert more than 25 million records.

1 / Why does the insertion coefficient KC become very low as soon as the number of bucket exceeds 25M?

With Berkeley DB : the best speed I got is slightly lower than KC, closer to 50,000 records / s, but still fine. With default settings, just like KC, speed drops suddenly after a million records. I know that BDB is designed to gradually increase the number of buckets. Regardless, he tried to increase the starting number by playing with HashNumElements and FillFactor, but any of these attempts made the situation worse. Therefore, I still cannot insert more than 1-2 million records with DBD. I tried to activate unsynchronized transactions, tried different speeds of checkpoints, increased caching. Nothing improves the fall.

2 / What can lead to a decrease in the insertion rate of BDB after 1-2 million insertions?

Note. . I work with java, and when the speed drops, the CPU load drops to 0-30%, and at 100% if it works correctly.
Note. Stopping the process and resuming the insert does not change anything. Therefore, I do not think that this is due to memory limitations or garbage collection.

thanks.

+7
source share
1 answer

The following shows how I managed to save billions of records, despite the recording restrictions encountered by KC.

With great difficulty, I still have not solved the problem for either the Kyoto cabinet or the Berkeley DB. However, I came up with an interesting workaround using the Kyoto Cabinet.

I noticed that I can not write more than 25 million records in one KC file, but reading does not have such a restriction - it is always fast, regardless of the size of the database. The solution I found is to create a new KC file (new database) for every 25 million new entries. Thus, reading occurs in many KC files and is still fast, and writing only to the last created file and is fast. The only problem remained is to allow updating / deleting of entries in previous files. For this, I copied the SSTables approach , which:

  • All files from 0 to N-1 are read-only, file N is read + writes.
  • Any insert / update / delete is written to file N.
  • Reads files from N to 0 and returns nested / updated / deleted records of the first and last time.
  • A flowering filter is attached to each file to avoid access to a file that does not have the required record.
  • Once file N reaches 25M records, it becomes read-only and file N + 1 is created.

Notes:

  • As with SSTables, if many updates / deletes are performed, we may need to perform the compaction. However, contrary to SSTables, there is no need to rewrite the file for compaction. Obsolete entries are simply deleted from the KC files, and if the KC file becomes very small, you can either delete it - restore the entries in the N- file or reopen it for new inserts - provided that the following files are compact.
  • Deleting does not delete the record, but records a special value that identifies the record as deleted. During compaction, deleted records are deleted for real.
  • Checking for a record usually requires a database search. Thanks to flower filters, most negative answers can be obtained without access to the disk.
+3
source

All Articles