Best hash function for mixed numeric and literal identifiers

For performance reasons, I need to split the set of objects identified by the string into groups. Objects can be identified by a number or a string in prefix (qualified) form with dots separating parts of the identifier:

12 323 12343 2345233 123123131 ns1:my.label.one ns1:my.label.two ns1:my.label.three ns1:system.text.one ns2:edit.box.grey ns2:edit.box.black ns2:edit.box.mixed 

Numeric identifiers from 1 to several million. Text identifiers most likely start a lot with the same namespace prefix (ns1 :) and with the same path prefix (edit.box.).

What is the best hash function for this purpose? It would be nice if I could somehow predict the size of the bucket based on the statistics of the object identifier. Are there any good articles to build a good hash function based on some statistical information?

There are several million such identifiers, but the goal is to divide them into groups of 1-2 thousand based on a hash function.

+6
algorithm hash-function
source share
3 answers

Two good hash functions can be mapped to the same value space and, as a rule, will not cause any new problems as a result of combining them.

So your hash function might look like this:

 if it an integer value: return int_hash(integer value) return string_hash(string value) 

If there are no lumps of your integers around certain values ​​modulo N, where N is the possible number of buckets, then int_hash can simply return its input.

Choosing a hash of the string is not a new problem. Try "djb2" ( http://www.cse.yorku.ca/~oz/hash.html ) or similar if you don't have obscene performance requirements.

I do not think that there are many ways to change the hash function to allow for common prefixes. If your hash function is good to start with, then it is unlikely that common prefixes will create any addition of hash values.

If you do this, and the hash does not fail unexpectedly, and you add several million hash values ​​to several thousand buckets, then usually the populations of the bucket will be distributed with an average (several million / several thousand) and a variance of 1/12 (several thousand) ^ 2

An average of 1,500 records for each bucket, which makes a standard deviation of about 430. 95% of the normal distribution is within 2 standard deviations from the average, so 95% of your buckets will contain 640-2360 records, if only I made my amounts wrong. Is this sufficient, or do you need buckets of closer sizes?

+3
source share

You are probably running away with sha1 and trimming it to any size.

That would not be very efficient, but maybe the hash function would not be a bottleneck?

0
source share

I believe that CRC16 will be a reasonable hash for use in these lines, and groups should not exceed 1-2 thousand.

This should make the hash table about 1 MB +, although you have many elements * 4 bytes, so we are talking about 50 MB, and then you also have all the actual data, which should be very small.

0
source share

All Articles