Efficient way to check for existence in a large rowset

Question

Efficient way to check for existence in a large rowset

I have a set of 100 + million lines, each of which is up to 63 characters long. I have a lot of disk space and very little memory (512 MB). I need to request existence alone and not store any additional metadata.

My de facto solution is BDB btree. Are there any preferred alternatives? I know that leveldb and Kyoto Cabinet are not familiar enough to identify the benefits.

+6

python berkeley-db leveldb kyotocabinet

Drew sears Nov 15 '12 at 20:19

source share

2 answers

You said you have a lot of discs, huh? One option would be to save strings in file names in nested subdirectories. You can use strings directly:

Save "pulled sears" to d/r/e/w/ sears

or by taking a hash of a string and following a similar process:

MD5 ('drew sears') = 'f010fe6e20d12ed895c10b93b2f81c6e'
Create an empty file named f0/10/fe/6e/20d12ed895c10b93b2f81c6e .

Think of it as an OS-optimized, NoSQL database based on a hash table.

Side benefits:

You can change your mind later and save the data in a file.
You can replicate your database to another system using rsync.

+1

Kirk Strauser Nov 15 '12 at 21:00

source share

senderle · Accepted Answer · 2012-11-15T20:47:02+0000

If false positives are acceptable, then one possible solution would be to use a bloom filter . Bloom filters are similar to hash tables, but instead of using a single hash value to index the bucket table, it uses multiple hashes to index the bitmap. Bits corresponding to these indices are set. Then, to check if there is a string in the filter, the string is hashed again, and if the corresponding indexes are set, then the string "is" in the filter.

It does not store any string information, so it uses very little memory - but if there is a collision between two lines, collision resolution is not possible. This means that there can be false positives (because a string that does not have a filter can have a hash with the same indices as a string that is in the filter). However, there can be no false negatives; any string that is actually in the set will be found in the color filter.

There are several Python implementations . It is also not difficult to collapse on your own; I remember how to code a fast-dirty bloom filter using bitarray , which worked very well.

Efficient way to check for existence in a large rowset

More articles: