It also depends on what you are trying to solve. You are trying to answer the question: in this directory of N files, find all the exact duplicates? Or are these two files the same?
If you just simply compare two files, then using a byte byte check is more efficient.
But if you are trying to find all duplicate pairs in N files, it is better to use the MD5 hash, because you can create and save the MD5 hash value once and compare this much smaller value with each pair of files. In other words, you will iterate over each file stream of bytes for every other file in the directory.
Turbo
source share