Two good hash functions can be mapped to the same value space and, as a rule, will not cause any new problems as a result of combining them.
So your hash function might look like this:
if it an integer value: return int_hash(integer value) return string_hash(string value)
If there are no lumps of your integers around certain values ββmodulo N, where N is the possible number of buckets, then int_hash can simply return its input.
Choosing a hash of the string is not a new problem. Try "djb2" ( http://www.cse.yorku.ca/~oz/hash.html ) or similar if you don't have obscene performance requirements.
I do not think that there are many ways to change the hash function to allow for common prefixes. If your hash function is good to start with, then it is unlikely that common prefixes will create any addition of hash values.
If you do this, and the hash does not fail unexpectedly, and you add several million hash values ββto several thousand buckets, then usually the populations of the bucket will be distributed with an average (several million / several thousand) and a variance of 1/12 (several thousand) ^ 2
An average of 1,500 records for each bucket, which makes a standard deviation of about 430. 95% of the normal distribution is within 2 standard deviations from the average, so 95% of your buckets will contain 640-2360 records, if only I made my amounts wrong. Is this sufficient, or do you need buckets of closer sizes?
Steve jessop
source share