How likely are md5 checksums false positive?

I have a client that distributes large binaries inside. They also transmit checksums of md5 files and apparently check files against the checksum before using as part of their workflow.

However, they claim that โ€œoftenโ€ they encounter corruption in files, where md5 still says the file is good.

All that I read suggests that this should be extremely unlikely.

Maybe that sounds? Will another hashing algorithm provide better results? Do I really have to look at process problems, such as they claim to verify the checksum, but don't really do it?

NB, I do not yet know what โ€œoftenโ€ means in this context. They process hundreds of files per day. I do not know if this is a daily, monthly or annual event.

+6
md5 checksum
source share
5 answers

MD5 is a 128-bit cryptographic hash function, so different messages should be distributed fairly well over 128-bit space. This means that two files (excluding files specially created for hitting MD5) should have a 1 in 2 ^ 128 chance of collision. In other words, if you compare a couple of files with every nanosecond, this has not happened yet.

+9
source share

If the file is damaged, the probability that the damaged file has the same md5 checksum as the unacknowledged file is 1: 2 ^ 128. In other words, this will happen almost as often as never before. Astronomically, it is more likely that your client is incorrectly reporting what actually happened (for example, they calculate the wrong hash)

+5
source share

It sounds like an error when using MD5 (maybe these are MD5-invalid files) or an error in the library used. For example, the older MD5 program that I used once did not process files larger than 2 GB.

This question suggests that on average you get a collision on average every 100 years if you generate 6 billion files per second, which is why this is unlikely.

+4
source share

Is this likely?

No, the probability of accidental damage causing the same checksum is 1 in 2 128 or 3.40 ร— 10 38 . This number puts 1 in a billion (10 9 ) odds .

Would another hashing algorithm get better results?

Probably not. While MD5 was compromised for resistance to collision against attack, it works great against accidental damage and the popular standard.

Do I really have to look at process problems, such as they claim to verify the checksum, but donโ€™t actually do it?

Perhaps, but consider all the possible problems:

  • File damaged before MD5 generation
  • File damaged after MD5 check.
  • There is an error in the MD5 program or supporting structure.
  • Incorrect use of the operator (unintentional, for example, starting the MD5 program in the wrong file)
  • Violation of the operator (intentional, for example, skipping the verification step)

IF this is the last, then one last thought is to distribute the files in a shell format, forcing the operator to expand the file, but the expansion does the check during extraction. I think something like Gzip or 7-Zip that supports large files and possibly disables compression (I don't know what they do).

+3
source share

There are all kinds of reasons why binary files will not be distributed or if they do, there is corruption (firewall, size limit, virus insertions, etc.). You should always encrypt files (even low level encryption is better than not) when sending binary files to protect data integrity.

0
source share

All Articles