Merkle Tree False Positives Data Sync

Merkle trees (aka hash trees) are used to synchronize data in both Kassandra and Dynamo.

Like any hash function, there is a possibility that different data may have the same hash value:

There is x and y, where [y! = X], but [hash (x) = hash (y)]

As "big data" grows in NOSQL, the likelihood of collision with such data becomes higher.

This means that as the data arrays grow larger, it is almost certain that different nodes in the Merkle tree will have the same parent hash.

In this case, when two different machines in the cluster intersect their merkle trees, they will get a false idea that their data is consistent. If no more data is written to this tree branch, the machines will remain unsynchronized forever.

How is this handled?

+7
source share
1 answer

Most systems cannot handle this. What for? Because the probability of having two different inputs that have the same hash value is very, very low. With a good hash function (which I think you are using), this should come close to 1/2 ^ {hash bits}. And since most hashes for these purposes have a length of at least 128 bits, you get a probability of 1/2 ^ 128 of such a collision. Which is about 2.9387359e-39 (0. {38 zeros} 29387359).

Using a hash of 160 bits (which is used by most of these systems, SHA-1 hashes) is good enough if you have as many objects in your database as there are grains of sand in the world. That you are still less than 1/2 likely that there will be such a collision. Thus, I would not worry about a collision event. The likelihood of this happening is really too low.

+6
source

All Articles