What is a good 64-bit hash function in Java for text strings?

I am looking for a hash function that:

  • Hashes of text strings (e.g. multiple collisions)
  • written in Java and widely used
  • Bonus: works in several fields (instead of me, concatenating them and applying a hash on the concatenated string)
  • Bonus: has a 128-bit option.
  • Bonus: Not processor intensity.
+52
java string 64bit collision hash
Nov 02 '09 at 10:35
source share
8 answers

Why don't you use the default long option String.hashCode() (where some really smart guys are definitely working hard to make it effective - not to mention the thousands of eyes of developers who have already looked at this code)?

 // adapted from String.hashCode() public static long hash(String string) { long h = 1125899906842597L; // prime int len = string.length(); for (int i = 0; i < len; i++) { h = 31*h + string.charAt(i); } return h; } 

If you are looking for even more bits, perhaps you can use BigInteger Change:

As I mentioned in a comment on @brianegge's answer, there are not many abbreviations for hashes with more than 32 bits and most likely not one for hashes with more than 64 bits:

I could imagine a huge hash table distributed on dozens of servers, possibly storing tens of billions of mappings. For such a scenario, @brianegge still has a valid point here: 32 bits allow the use of 2 x 32 (about 4.3 billion) different hash keys. Assuming a strong algorithm, you should still have quite a few collisions. With 64-bit (18,446,744,073,000 different keys) you will definitely save, no matter what crazy scenario you need. The idea of ​​using 128-bit keys (340,282,366,920,938,463,463,374,607,431 billion possible keys) is largely impossible.

To combine the hash for multiple fields, just make XOR multiply one by simple and add them:

 long hash = MyHash.hash(string1) * 31 + MyHash.hash(string2); 

A small simple place is there to avoid equal hash code for dial-up values, i.e. {'foo', 'bar'} and {'bar', 'foo'} are not equal and must have a different hash code. XOR is bad because it returns 0 if both values ​​are equal. Therefore, {'foo', 'foo'} and {'bar', 'bar'} will have the same hash code.

+54
Nov 02 '09 at 11:00
source share

Create a SHA-1 hash , and then mask the least significant 64 bits.

+5
Nov 02 '09 at 10:49
source share
 long hash = string.hashCode(); 

Yes, the top 32 bits will be 0, but you will probably run out of hardware resources before encountering problems with hash collisions. The hash code in String is pretty efficient and well tested.

Update I think the above satisfies the simplest thing that could work, however I agree with @sfussenegger's idea of ​​extending the existing String hash.

Besides having a good hash code for your string, you might want to rename the hash code in your implementation. If your store is used by other developers or used with other types, this can help distribute your keys. For example, Java HashMap is based on hash tables with force characteristics of the length, so it adds this function to ensure that the low bits are sufficiently distributed.

  h ^= (h >>> 20) ^ (h >>> 12); return h ^ (h >>> 7) ^ (h >>> 4); 
+4
Nov 02 '09 at 11:42
source share

Why not use the CRC64 polynomial. They are efficient and optimized enough to ensure that all bits are counted and distributed across the result space.

There are many implementations on the net if you use Google CRC64 Java

+2
Jun 03 '10 at 10:15
source share

Do something like this:

 import java.io.ByteArrayOutputStream; import java.io.DataOutputStream; import java.io.IOException; import java.math.BigInteger; import java.security.MessageDigest; import java.security.NoSuchAlgorithmException; public class Test { public static void main(String[] args) throws NoSuchAlgorithmException, IOException { ByteArrayOutputStream baos = new ByteArrayOutputStream(); DataOutputStream dos = new DataOutputStream(baos); try { MessageDigest md = MessageDigest.getInstance("MD5"); SomeObject testObject = new SomeObject(); dos.writeInt(testObject.count); dos.writeLong(testObject.product); dos.writeDouble(testObject.stdDev); dos.writeUTF(testObject.name); dos.writeChar(testObject.delimiter); dos.flush(); byte[] hashBytes = md.digest(baos.toByteArray()); BigInteger testObjectHash = new BigInteger(hashBytes); System.out.println("Hash " + testObjectHash); } finally { dos.close(); } } private static class SomeObject { private int count = 200; private long product = 1235134123l; private double stdDev = 12343521.456d; private String name = "Test Name"; private char delimiter = '\n'; } } 

DataOutputStream allows you to write primitives and strings and output them as bytes. Wrapping a ByteArrayOutputStream in it will allow you to write to an array of bytes, which goes well with MessageDigest . You can choose any of the listed algorithms here .

Finally, BigInteger allows you to turn output bytes into an easier to use number. The MD5 and SHA1 algorithms generate 128-bit hashes, so if you need 64, you can just trim.

SHA1 should hash almost nothing good, and with infrequent collisions (this is 128-bit). This works with Java, but I'm not sure how it is implemented. It can be pretty fast. It works with several fields in my implementation: just click them on the DataOutputStream and you are good to go. You could even do this with reflection and annotations (maybe @HashComponent(order=1) to show which fields get into the hash and in what order). He got a 128-bit version, and I think you will find that he does not use as much CPU as you think.

I used this code to get hashes for huge data sets (now probably billions of objects) to wrap them in many backend stores. It should work for everything you need. Note that I think you can only call MessageDigest.getInstance() once, and then clone() from now on: IIRC cloning is much faster.

+1
Jun 03 '10 at
source share

Flip the line to get another 32-bit hash code, and then combine the two:

 String s = "astring"; long upper = ( (long) s.hashCode() ) << 32; long lower = ( (long) s.reverse().hashCode() ) - ( (long) Integer.MIN_VALUE ); long hash64 = upper + lower; 

This is pseudo code; The String.reverse() method does not exist and must be implemented in some other way.

+1
Aug 6 '14 at 7:17
source share

Are you looking at Apache commons lang ?

But for 64-bit (and 128) you need some tricks: the rules outlined in Joshua Bloch's book Effective Java help you create a 64-bit hash easily (just use long instead of int). For 128 bits you need extra hacks ...

0
Nov 02 '09 at 11:12
source share

DISCLAIMER: This solution is applicable if you want to effectively use individual words in a natural language. This is ineffective for hashing longer text or text containing non-alphabetic characters.

I don't know about the function, but here is an idea that might help:

  • Count 52 out of 64 bits to represent which letters are present in String. For example, if “a” is present, you must set bit [0], for “b” set bit 1 , for “A” bit [26]. Thus, only text containing exactly the same set of letters will have the same “signature”.

Then you could use the remaining 12 bits to encode the length of the string (or its modulo) to further reduce collisions or create a 12-bit hash code using a traditional hash function.

Assuming your input is textual, I can imagine that this would lead to very few collisions and would be inexpensive to compute (O (n)). Unlike other solutions, this approach still takes into account the problem area for reducing conflicts . It is based on the Anagram detector described in Pearl Programming (see here ).

-2
Nov 02 '09 at 10:47
source share