Same implementation of consistent hashing for Java and Python

We have an application in which the Python module will write data to redis shards, and the Java module will read data from redis shards, so I need to implement the exact same consistent hash algorithm for Java and Python to make sure the data can be found.



I googled around and tried several implementations, but found that the Java and Python implementations are always different, they cannot be used by togather. You need your help.

Edit, online implementation I tried:
Java: http://weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.html
Python: http://techspot.zzzeek.org/2012/07/07/the-absolutely-simplest-consistent-hashing-example/
http://amix.dk/blog/post/19367

Change, enable Java (using Google Guava lib) and the Python code I wrote. The code is based on the above articles.

import java.util.Collection; import java.util.SortedMap; import java.util.TreeMap; import com.google.common.hash.HashFunction; public class ConsistentHash<T> { private final HashFunction hashFunction; private final int numberOfReplicas; private final SortedMap<Long, T> circle = new TreeMap<Long, T>(); public ConsistentHash(HashFunction hashFunction, int numberOfReplicas, Collection<T> nodes) { this.hashFunction = hashFunction; this.numberOfReplicas = numberOfReplicas; for (T node : nodes) { add(node); } } public void add(T node) { for (int i = 0; i < numberOfReplicas; i++) { circle.put(hashFunction.hashString(node.toString() + i).asLong(), node); } } public void remove(T node) { for (int i = 0; i < numberOfReplicas; i++) { circle.remove(hashFunction.hashString(node.toString() + i).asLong()); } } public T get(Object key) { if (circle.isEmpty()) { return null; } long hash = hashFunction.hashString(key.toString()).asLong(); if (!circle.containsKey(hash)) { SortedMap<Long, T> tailMap = circle.tailMap(hash); hash = tailMap.isEmpty() ? circle.firstKey() : tailMap.firstKey(); } return circle.get(hash); } } 

Test code:

  ArrayList<String> al = new ArrayList<String>(); al.add("redis1"); al.add("redis2"); al.add("redis3"); al.add("redis4"); String[] userIds = {"-84942321036308", "-76029520310209", "-68343931116147", "-54921760962352" }; HashFunction hf = Hashing.md5(); ConsistentHash<String> consistentHash = new ConsistentHash<String>(hf, 100, al); for (String userId : userIds) { System.out.println(consistentHash.get(userId)); } 

Python Code:

 import bisect import md5 class ConsistentHashRing(object): """Implement a consistent hashing ring.""" def __init__(self, replicas=100): """Create a new ConsistentHashRing. :param replicas: number of replicas. """ self.replicas = replicas self._keys = [] self._nodes = {} def _hash(self, key): """Given a string key, return a hash value.""" return long(md5.md5(key).hexdigest(), 16) def _repl_iterator(self, nodename): """Given a node name, return an iterable of replica hashes.""" return (self._hash("%s%s" % (nodename, i)) for i in xrange(self.replicas)) def __setitem__(self, nodename, node): """Add a node, given its name. The given nodename is hashed among the number of replicas. """ for hash_ in self._repl_iterator(nodename): if hash_ in self._nodes: raise ValueError("Node name %r is " "already present" % nodename) self._nodes[hash_] = node bisect.insort(self._keys, hash_) def __delitem__(self, nodename): """Remove a node, given its name.""" for hash_ in self._repl_iterator(nodename): # will raise KeyError for nonexistent node name del self._nodes[hash_] index = bisect.bisect_left(self._keys, hash_) del self._keys[index] def __getitem__(self, key): """Return a node, given a key. The node replica with a hash value nearest but not less than that of the given name is returned. If the hash of the given name is greater than the greatest hash, returns the lowest hashed node. """ hash_ = self._hash(key) start = bisect.bisect(self._keys, hash_) if start == len(self._keys): start = 0 return self._nodes[self._keys[start]] 

Test code:

 import ConsistentHashRing if __name__ == '__main__': server_infos = ["redis1", "redis2", "redis3", "redis4"]; hash_ring = ConsistentHashRing() test_keys = ["-84942321036308", "-76029520310209", "-68343931116147", "-54921760962352", "-53401599829545" ]; for server in server_infos: hash_ring[server] = server for key in test_keys: print str(hash_ring[key]) 
+7
source share
7 answers

It seems that you are facing two problems at the same time: encoding problems and presentation problems.

Encoding problems arise, especially if you use Python 2 - the Python 2 str type is not at all similar to the Java String type and actually looks more like a Java byte array. But Java String.getBytes() not guaranteed to give you an array of bytes with the same contents as Python str (they probably use compatible encodings, but are not guaranteed - even if this fix does not change things, it’s a good idea in general to avoid problems in the future).

So, the way around this is to use the Python type, which behaves like a Java String , and convert the corresponding objects from both languages ​​into bytes with the same encoding. On the Python side, this means that you want to use the unicode type, which is the default string literal type if you are using Python 3, or put it at the top of your .py file:

 from __future__ import unicode_literals 

If none of them is an option, specify your string literals as follows:

 u'text' 

u at the front makes him work in unicode. This can then be converted to bytes using its encode method, which (surprisingly) encodes:

 u'text'.encode('utf-8') 

On the Java side, there is an overloaded version of String.getBytes that accepts the encoding, but it takes it as java.nio.Charset and not a string - so you want to do:

 "text".getBytes(java.nio.charset.Charset.forName("UTF-8")) 

This will give you equivalent byte sequences in both languages, so the hashes will have the same input and give you the same answer.

Another problem that may arise is the presentation, depending on the hash function used. Python hashlib (which is the preferred implementation of md5 and other cryptographic hashes with Python 2.5) is exactly compatible with Java MessageDigest in this - they both give bytes, so their output should be equivalent.

Python zlib.crc32 and Java java.util.zip.CRC32 , on the other hand, both give numerical results, but Java is always an unsigned 64-bit number, while Python (in Python 2) is a 32-bit signed number (in Python 3, now it's a 32-bit unsigned number, so this problem goes away). To convert a signed result to unsigned, run: result & 0xffffffff , and the result should be comparable to Java.

+7
source

According to this hash analysis :

Murmur2, Meiyan, SBox and CRC32 provide good performance for all kinds of keys. They can be recommended as universal hash functions on x86.

Hardware-accelerated CRC (indicated by iSCSI CRC in the table) is the fastest hash function on the latest Core i5 / i7 processors. However, the CRC32 instruction is not supported by AMD and earlier Intel processors.

Python has zlib.crc32 , and Java has a CRC32 class . Since this is a standard algorithm, you should get the same result in both languages.

MurmurHash 3 is available on Google Guava (a very useful Java library) and pyfasthash for Python.

Please note that these are not cryptographic hash functions , therefore they are fast, but do not give the same guarantees. If these hashes are important for security, use a cryptographic hash.

+3
source

Different language implementations of the hash algorithm do not make the hash value different. The SHA-1 hash generated in java or python will be the same.

+2
source

I am not familiar with Redis, but the Python example looks like hashing keys, so I assume we are talking about some kind of HashMap implementation.

The example in your python example uses MD5 hashes, which will be the same in both Java and Python.

Here is an example of MD5 hashing in Java:

http://www.dzone.com/snippets/get-md5-hash-few-lines-java

And in Python:

http://docs.python.org/library/md5.html

Now you can find a faster hash algorithm. MD5 is focused on cryptographic security, which is not needed in this case.

+2
source

Let me get it straight: the same binary input to the same hash function (SHA-1, MD5, ...) in different environments / implementations (Python, Java, ...) give the same binary output. This is because these hash functions are implemented in accordance with standards .

Therefore, you will find the sources of problems that you encounter when answering these questions:

  • Do you provide the same binary input for both hash functions (e.g. MD5 in Python and Java)?

  • Do you interpret the binary output of both hash functions (e.g. MD5 in Python and Java)?

@lvc answer provides much more detailed information on these issues.

+1
source

Here is a simple hash function that gives the same results for both python and java keys:

Python

 def hash(key): h = 0 for c in key: h = ((h*37) + ord(c)) & 0xFFFFFFFF return h; 

Java

 public static int hash(String key) { int h = 0; for (char c : key.toCharArray()) h = (h * 37 + c) & 0xFFFFFFFF; return h; } 

You do not need a cryptographically secure hash for this. This is just superfluous.

+1
source

In the java version, I would recommend using MD5, which generates a 128-bit string result and then can be converted to BigInteger (Integer and Long are not enough to store 128-bit data).

Sample code here:

 private static class HashFunc { static MessageDigest md5; static { try { md5 = MessageDigest.getInstance("MD5"); } catch (NoSuchAlgorithmException e) { // } } public synchronized int hash(String s) { md5.update(StandardCharsets.UTF_8.encode(s)); return new BigInteger(1, md5.digest()).intValue(); } } 

Note that:

java.math.BigInteger.intValue () converts this BigInteger to int. This conversion is similar to narrowing a primitive conversion from long to int. If this BigInteger is too large to fit int , only 32 low order bits are returned. This conversion can lose information about the total value of the BigInteger value, and also return the result with the opposite sign.

0
source

All Articles