C: Storing up to a million entries in a hash table

I am working on a project where efficiency is critical. The hash table will be very useful, since I need to easily find the node memory address based on the key. The only problem I foresee is a hash table that should process up to 1 million records. As I understand it, usually hash tables buckets are a linked list, so that they can process multiple entries in the same bucket. It seems to me that with millions of entries, these lists will be too slow. What is the general way to implement something like this. Perhaps the exchange of a list of standard links to the skip list?

+4
source share
6 answers

If you want to have a hash table with a million records, usually you should have at least 2 million buckets. I don’t remember all the statistics (the key term is “paradoxical birthday”), but the vast majority of buckets will have zero or one element. In principle, you can be very unlucky and get all the items in one bucket, but you will have even more bad luck than those people who seem struck by lightning in a day.

For hash tables that grow, the normal trick should grow by a constant percentage - the usual case of a textbook grows, doubling the size of the hash table. You do this every time the number of items in the hash table reaches a certain fraction of the size of the hash table, regardless of how many buckets are actually used. This gives the amortized expected O (1) performance for insert, delete, and search.

The linked list in each bucket of the hash table is just a way to handle collisions - unlikely in the sense for each operation, but during the life of a significant hash table they occur, especially in the form of a hash table, gets more than half.

Linked lists are not the only way to handle collisions - there is a huge amount of knowledge about this topic. Walter Bright (a developer of the D programming language) advocated using binary trees rather than linked lists, claiming that his Dscript significantly increased performance compared to the Javascript from this design.

He used simple (unbalanced) binary trees when I asked, so that the worst performance was the same as for linked lists, but the key point that I assume is that the binary tree processing code is simple and the hash table itself makes the chances of building large unbalanced trees very small.

Basically, you can just as easily use treaps, red-black trees or AVL trees. An interesting option would be to use splay trees to handle collisions. But overall, this is a small problem for several library designers and some real obsessive things to worry about.

+3
source

You lose all the benefits of a hash table if the lists of each bucket contain more than a few entries. The usual way to make a hash table for millions of records is to make the underlying hash array mutable, so even with millions of records, bucket lists remain short.

+3
source

You can use the tree instead of the list in separate "buckets". (AVL or similar)

EDIT: well, Skip List too. (and seems to be faster) - O (log n) is what you are aiming for.

+1
source

The total number of entries does not matter, only the average number of entries per bucket (N / hash size). Use a hash function with a larger domain (e.g. 20 bits or more) to ensure this.

Of course, it will take more memory, but this, this is a common memory against speed compromise.

+1
source

Not sure if this will help you or not, but maybe: http://memcached.org/

0
source

If your keys have a normal distribution (this is a very large IF), then the expected number of inserts in the hash table to retrieve all the buckets in the hash table is M * logM (natural log, to base e), where M is the number of buckets.

I was surprised, could not find it easily online!

I posted the output of the same to my blog and tested it using code using rand (). This seems to be a pretty good mark.

0
source

All Articles