I am on a personal search to find out how the rsync algorithm works. After some reading and thinking, I came up with a situation where, I think, the algorithm fails. I am trying to understand how this is allowed in a real implementation.
Consider this example, where A is the receiver and B is the sender.
A = abcde1234512345fghij B = abcde12345fghij
As you can see, the only change is that 12345 been deleted.
Now, to make this example interesting, choose a block size of 5 bytes (characters). Hashing values ββon the sender side using a weak checksum gives the following list of values.
abcde|12345|fghij abcde -> 495 12345 -> 255 fghij -> 520 values = [495, 255, 520]
Next, we check if any hash values ββdiffer in A. If there is a corresponding block, we can skip to the end of this block for the next check. If there is an inappropriate block, we find the difference. I will go through this process.
- Hash the first block. Does this hash exist in the list of values?
abcde -> 495 (yes, skip this) - Hash the second block. Does this hash exist in the list of values?
12345 -> 255 (yes, miss it) - Hash the third block. Does this hash exist in the list of values?
12345 -> 255 (yes, miss it) - Hash the fourth block. Does this hash exist in the list of values?
fghij -> 520 (yes, so skip) - No more data, we are done.
Since each hash was found in the list of values, we conclude that A and B are the same. Which, in my humble opinion, is untrue.
It seems to me that this will happen when there are more than one block that use the same hash. I know that I skipped the step of calculating and checking a strong hash, but this will not change the situation, since the second and third blocks are exactly the same
What am I missing?
algorithm hash rsync
Kai
source share