When to resize a hash table?

Question

When to resize a hash table?

In various hash table implementations, I have seen “magic numbers” when a mutable hash table needs to change (grow). Usually this number is somewhere between 65% and 80% of the values added to the selected slots. I suggest that the trade-off is that a larger number will allow more collisions and a smaller number due to the use of more memory.

My question is: how did this number arrive?

Is this arbitrary? based on testing? based on some other logic?

+8

hashtable algorithm

Nick van brunt Feb 10 '11 at 16:01

source share

5 answers

I think you don’t want to consider “how full” the table is (how many “buckets” of common buckets matter), but rather the number of collisions that may be required to find a place for a new element.

I read several compiler books years ago (I don’t remember the name or authors) who suggested using only linked lists until you have more than 10-12 elements. It seems that support for more than 10 collisions means time to resize.

Design and implementation of dynamic. Hashing for sets and tables in the icon suggests that the average length of the hash chain of 5 (in this algorithm, the average number of collisions) is sufficient to trigger a rebuild. Testing seems to be supported, but I'm not sure if I am reading paper correctly.

It seems that the resizing condition is mainly the result of testing.

+5

Bruce ediger Feb 10 '11 at 16:48

source share

It depends on the keys. If you know that your hash function is ideal for all possible keys (e.g. using gperf ), then you know that you will only have a few collisions, so the number is higher.

But most of the time you know little about keys, except that it is text. In this case, you should guess, since you don’t even have test data to figure out in advance how your hash function behaves.

So you hope for the best. If the hash function is very bad for the keys, then you will have many collisions and the growth point will never be reached. In this case, the selected digit does not matter.

If your hash function is adequate, then it should only create a few collisions (less than 50%), so a number from 65% to 80% seems reasonable.

However: if your hash table does not have to be perfect (= huge size or many accesses), do not worry. If you have, say, ten elements, given these problems, this is a waste of time.

+2

Aaron digulla Feb 10 '11 at 16:09

source share

As far as I know, this is a heuristic based on empirical testing.

Given a fairly good distribution of hash values, it seems that the magic load factor - as you say - is usually around 70%. A lower load factor means that you lose space without real benefits; A higher load factor means that you will use less space, but spend more time colliding with hash collisions.

(Of course, if you know that your hash values are perfectly distributed, your load factor can be 100%, and you still won't be wasted space and no hash collisions.)

+1

Lukeh Feb 10 '11 at 16:09

source share

Collisions are highly data dependent and hash functions are used.

Most numbers are based on heuristics or on the assumption of a normal distribution of hash values. (AFAIK values of about 70% are typical for extensible hash tables, but you can always build such a data stream that you get much more / less collisions)

+1

p4553d Feb 10 '11 at 16:10

source share

Jerry Coffin · Accepted Answer · 2011-02-10T16:13:46+0000

By assumption, most people at least start with numbers in the book (e.g. Knuth, Volume 3) that were obtained through testing. Depending on the situation, some of them may subsequently conduct testing and make appropriate adjustments - but from what I saw, they are probably in the minority.

As I stated in the previous answer , the “correct” number also depends heavily on how you resolve conflicts. For better or worse, this fact seems to be widely ignored - people often don’t choose numbers that are especially suitable for resolving the conflict they use.

OTOH, another point that I found in my testing, is that it rarely solves the difference. You can select numbers in a fairly wide range and get pretty close overall speed. The main thing is to try to avoid too many numbers, especially if you use something like linear sensing to resolve conflicts.

When to resize a hash table?

More articles: