Hashtable / Dictionary Conflicts

Using only standard English letters and underlining, how many characters can be used up to the maximum without causing a potential collision in the hash table / dictionary.

So lines like:

blur Blur b Blur_The_Shades_Slightly_With_A_Tint_Of_Blue 

...

+4
string dictionary hashtable math c #
source share
5 answers

There is no guarantee that you will not encounter single letters.

You probably won't, but the algorithm used in string.GetHashCode is not specified and may change. (In particular, it changed between .NET 1.1 and .NET 2.0, which burned down people who speculated that this would not change.)

Please note that hash code collisions will not stop well-crafted hash tables from working - you can still get the values ​​you need, you just need to check more than one key using equality if they have the same hash code.

In any dictionary that relies on unique hash codes, important information about hash codes is missing, IMO :) (If this does not work in special conditions, where it absolutely knows that they will be unique, that is, using the perfect hash function .)

+15
source share

Given an ideal hashing function (which you generally don't care about, as others have pointed out), you can find the maximum possible number of characters that guarantees the absence of two lines will cause a collision as follows:


Not. unique hash codes avilable = 2 ^ 32 = 4294967296 (assuming a 32-bit integer is used for the hash codes) Character set size = 2 * 26 + 1 = 53 (26 below, like uppercase letters in the Latin alphabet plus a character underscores)

Then you should consider that the string length l (or less) has a total of 54 ^ l representations. Note that the base is 54, not 53, because the line can end after any character, adding an extra char feature - not that it greatly affects the result.

Taking value. unique hash codes as the maximum number of string representations, you get the following simple equation:

54 ^ l = 2 ^ 32

And having decided this:

 log2 (54 ^ l) = 32 l * log2 54 = 32 l = 32 / log2 54 = 5.56 

(Where log2 is the function of the logarithm of base 2).

Since line lengths clearly cannot be fractional, you take an integral part to give a maximum length of only 5 . It is actually very short, but note that this restriction would prevent even the most distant chance of a collision, given the excellent hash function.


This is basically theoretical, as I mentioned, and I'm not sure how much it can be used when considering design. Saying this, we hope that this will help you understand this issue from a theoretical point of view, on which you can add practical considerations (for example, not perfect hash functions, uneven distribution).

+3
source share

Universal hashing

To calculate the probability of collisions with S strings of length L with W bits per character up to a hash of length H bits, assuming the optimal universal hash ( 1 ), you can calculate the probability of a collision based on a hash table of size (number of buckets) of 'N`.

First, we can assume an ideal implementation of the hash table ( 2 ), which perfectly breaks the H bits in the hash into the available N ( 3 ) buckets. This means that H becomes meaningless, with the exception of the limit for N W and 'L' are just the basis for the upper bound for S For simpler mathematics, suppose that the string lengths < L simply padded to L with a special null character. If we were interested, we are interested in the worst case: it is 54 ^ L (26 * 2 + '_' + null), it’s just a ridiculous number, the actual number of entries is more useful than the character set and length, so we’ll just work like this as if S was a variable in its own right.

It remains to try to put S elements in N buckets. This is becoming a very famous issue, a paradoxical birthday

The solution to this for various probabilities and the number of buckets is instructive , but provided that we have 1 billion buckets (approximately 4 GB of memory in a 32 bit system), then we need only 37 thousand records before we reach a 50% probability of them in at least one collision. Given that trying to avoid any collisions in the hash table becomes clearly absurd.

All this does not mean that we should not care about the behavior of our hash functions. Obviously, these numbers imply ideal implementations; they are an upper bound on how good we can get. A bad hash function can lead to much stronger collisions in some areas, discarding part of the possible “space”, never or rarely using it, all of which can cause hashes to be less optimal and even degrade to performance that looks like a list but with much worse persistent factors.

Implementing a .NET hash function of a hash function is small (in the sense that it could be better), but is probably acceptable to the vast majority of users and reasonably efficient to calculate.

Alternative Approach: Perfect Hashing

If you want you to be able to generate so-called perfect hashes , this requires a thorough knowledge of the input values ​​in advance, but this is not so often useful. In simliar veins to the above math, we can show that even perfect hashing has its limits:

Recall the limit of lines 54 ^ L length L However, we only have bits H (suppose 32), which is about 4 billion different numbers. Therefore, if you can have truly any string and any number of them, then you need to satisfy:

 54 ^ L <= 2 ^ 32 

And having decided this:

 log2 (54 ^ L) <= 32 L * log2 54 <= 32 L <= 32 / log2 54 <= 5.56 

Since the length of the lines clearly cannot be fractional, you are left with a maximum length of only 5. Very short.

If you know that you only have a set of strings less than 4 billion in size, then perfect hashing will allow you to process any L value, but limiting the set of values ​​can be very difficult in practice, and you must know them all in advance or degrade to the point that compiles the database string → hash and add to it when new lines are encountered.


  • For this exercise, a universal hash is optimal, since we want to reduce the likelihood of any collision, i.e. for any input, the probability that it has output x from the set of possibilities R is 1 / R.

  • Note that doing optimal hashing (and internal bucketing) work is pretty tricky, but you should expect inline types to be reasonable, if not always ideal.

  • In this example, I avoided the question of closed and open addressing. This has something to do with probabilities, but not significantly.

+3
source share

The hash algorithm should not guarantee uniqueness. Given that there are many more potential lines (26 ^ n for n lengths, even ignoring special characters, spaces, uppercase letters, non-elliptic characters, etc.), than there is space in your hash table, there is no such guarantee. should guarantee a good distribution.

+1
source share

If your key is a string (like a dictionary), then GetHashCode () will be used. This is a 32 bit integer. The default hashtable has a value of 1 for estimating the load factor and increases the number of buckets to maintain this load factor. Therefore, if you see collisions, they should tend to occur around the boundaries of redistribution (and decrease soon after redistribution).

0
source share

All Articles