Understanding hash code

a hash function is important when implementing a hash table. I know that in java an Object has its own hash code, which can be created from a weak hash function.

Below is a snippet, which is an "additional hash function"

static int hash(Object x) { int h = x.hashCode(); h += ~(h << 9); h ^= (h >>> 14); h += (h << 4); h ^= (h >>> 10); return h; } 

Can someone help explain what is the main idea of ​​the hash algorithm? generate non duplicate integer? If so, how do these bitwise Operations do this?

+6
java hash
source share
6 answers

A hash function is any well-defined procedure or mathematical function that converts a large, possibly variable-sized, data into a small database, usually a single integer that can serve as an index to an array. The values ​​returned by the hash function are called hash values, hash codes, hash amounts, checksums, or simply hashes. ( wikipedia )

Using more of the “human” language of an object hash is a short and compact meaning based on the properties of the object. That is, if you have two objects that change somehow - you can expect that their hash values ​​will be different. A good hashing algorithm creates different values ​​for different objects.

+5
source share

What you usually try to do with the hash algorithm is to convert the large search key to a small non-negative number, so you can find the related entry in the table somewhere and do it faster than M log2 N (where M is the cost of the “comparison” ", and N is the number of elements in the" table "), typical of a binary search (or tree search).

If you are lucky enough to have a perfect hash, you know that any element of your (known!) Key set will be hashed to a unique, different value. Ideal hashes are primarily of interest to things like compilers that need to search for language keywords.

In the real world, you have imperfect hashes where several keys have all hashes with the same value. This is normal: now you only need to compare the key with a small set of candidate matches (those with a hash for this value), and not a large set (full table). Small sets are traditionally called "buckets." You use the hash algorithm to select the buckets, then you use some other data structure to search for the buckets themselves. (If the number of elements in the bucket is known or is safely expected to be really small, a linear search is not unreasonable. Binary search trees are also reasonable.)

The bitwise operations in your example are very similar to the signature analysis shift register, which tries to compress a long unique bit pattern into a short, still unique pattern.

+1
source share

Basically, the thing you are trying to achieve with a hash function is to give all bits of the hash code a roughly 50% chance to leave or give away a specific item that needs to be hashed. Thus, no matter how many buckets your hash table has (or in another way, how many of the lower bits you take to determine the bucket number) - if each bit is as random as possible, then the element will always be assigned purely random bucket.

Now, in real life, many people use hash functions that are not so good. They have some randomness in some bits, but not in all of them. For example, imagine if you have a hash function whose bits 6-7 are offset - say, in a typical hash code of an object, they have a 75% chance of being set. In the above example, if our hash table has 256 buckets (i.e., the number of buckets comes from bits 0-7 of the hash code), then we discard the randomness that exists in bits 8-31, and the smaller part of the buckets will have a tendency to fill (i.e., those whose numbers are bits 6 and 7).

An additional hash function basically tries to extend any randomness in the hash codes to a larger number of bits. Thus, in our hypothetical example, the idea would be that some randomness from bits 8-31 would mix with the least significant bits and blur the shift of bits 6-7. It will still not be perfect, but better than before.

+1
source share

If you create a hash table, then the main thing that you want to get when writing your hash function is to ensure uniformity, and it is not necessary to create completely unique values.

For example, if you have a hash table of size 10, you do not want the hash function to return the hash from 3 again and again. Otherwise, this particular bucket will force O (n) lookup time. You need a hash function to get it back, for example: 1, 9, 4, 6, 8 ... and make sure that none of your buckets are much heavier than others.

For your projects, I would recommend using a well-known hash algorithm such as MD5 or even better, SHA and use the first k bits that you need and discard the rest. These are time-tested functions and as a programmer, you would be smart to use them.

+1
source share

This code tries to improve the quality of the hash value by knocking a bit around.

The overall effect is that for a given x.hashCode (), you hopefully get a better distribution of the hash values ​​across the entire range of integers. Some algorithms will improve performance if you start with a poor hashcode implementation, but then improve the hash codes this way.

For example, hashCode () for a humble Integer in Java just returns an integer value. Although this is good for many purposes, in some cases you need a much better hash code, so placing a hashCode using this kind of function will greatly improve it.

0
source share

This may be all you want if you stick to the general contract described in the document, which, in my own words,

  • If you call a hash code 100 (N) times per object, it should always return the same value, at least during the execution of this program (subsequent execution of the program may return another)
  • If o1.equals(o2) true, then o1.hashCode() == o2.hashCode() should be true also
  • If o1.equals(o2) is false, then o1.hashCode() == o2.hashCode() may be true, but that does not help.

What is it.

Depending on the nature of your class, hashCode () e can be very complex or very simple. For example, the String class, which can have millions of instances, needs a goo hashCode and uses prime numbers to reduce the chance of collisions.

If it makes sense for your class to have a serial number, this is also normal, there is no reason why you should complicate it every time.

0
source share

All Articles