How does the rsync algorithm correctly identify duplicate blocks?

Question

How does the rsync algorithm correctly identify duplicate blocks?

I am on a personal search to find out how the rsync algorithm works. After some reading and thinking, I came up with a situation where, I think, the algorithm fails. I am trying to understand how this is allowed in a real implementation.

Consider this example, where A is the receiver and B is the sender.

A = abcde1234512345fghij B = abcde12345fghij

As you can see, the only change is that 12345 been deleted.

Now, to make this example interesting, choose a block size of 5 bytes (characters). Hashing values on the sender side using a weak checksum gives the following list of values.

 abcde|12345|fghij abcde -> 495 12345 -> 255 fghij -> 520 values = [495, 255, 520]

Next, we check if any hash values differ in A. If there is a corresponding block, we can skip to the end of this block for the next check. If there is an inappropriate block, we find the difference. I will go through this process.

Hash the first block. Does this hash exist in the list of values? abcde -> 495 (yes, skip this)
Hash the second block. Does this hash exist in the list of values? 12345 -> 255 (yes, miss it)
Hash the third block. Does this hash exist in the list of values? 12345 -> 255 (yes, miss it)
Hash the fourth block. Does this hash exist in the list of values? fghij -> 520 (yes, so skip)
No more data, we are done.

Since each hash was found in the list of values, we conclude that A and B are the same. Which, in my humble opinion, is untrue.

It seems to me that this will happen when there are more than one block that use the same hash. I know that I skipped the step of calculating and checking a strong hash, but this will not change the situation, since the second and third blocks are exactly the same

What am I missing?

+7

algorithm hash rsync

Kai Apr 1 '10 at 3:10

source share

1 answer

Dean harding · Accepted Answer · 2010-04-01T03:17:15+0000

The rsync algorithm sends two checksums: one for each fragment and a "rolling" checksum for the entire file. In your example, A will see the difference in the checksum when it reaches the "doubling" block.

How does the rsync algorithm correctly identify duplicate blocks?

More articles: