To check if two image files are the same .. Cheshsum or Hash?

I am doing image processing code where I load some images (like BufferedImage) from URLs and pass them to the image processor.

I want to avoid transferring the same image more than once to the image processor (since the image processing operation is expensive). The endpoints of the image URLs (if they are the same images) may be different, and therefore I can prevent this from the URL. Therefore, I planned to make a checksum or hash to determine if this code is encountered again.

For md5, I tried Fast MD5 , and it generated the sixth character checksum of a length of 20 Kbytes for the image (example). Obviously, saving this 20K + hash character will be a problem when it comes to storing the database. So I tried CRC32 (from java.util.zip.CRC32). And he did generate the sum total less than the hash.

I understand that checksum and hash are for different purposes. For the purpose described above, can I just use CRC32? Will this solve the goal, or should I try something more than these two?

Thanks Abi

+8
java integrity image-processing hash checksum
source share
3 answers

The difference between CRC and, say, MD5 is that itโ€™s harder to fake a file to match the โ€œtargetโ€ MD5 than to fake it according to the โ€œtargetโ€ checksum. Since this does not seem to be a problem for your program, it does not matter which method you use. Maybe MD5 may be a bit more processor intensive, but I don't know if this question will matter.

The main issue should be the number of digest bytes.

If you make a checksum in integer, this means that for a 2K file you select 2 ^ 2048 combinations in 2 ^ 32 combinations โ†’ for each CRC value, you will have 2 ^ 64 possible files corresponding to this. If you have 128 bit MD5, you have 2 ^ 16 possible collisions.

The larger the code you compute, the less chance of a collision (assuming that the calculated codes are distributed evenly), so a safer comparison.

In any case, in order to minimize possible errors, I think that the first classification should use the file size ... first compare the file sizes, if they match, then compare the checksums / hash.

+5
source share

The checksum and hash are basically the same. You should be able to calculate any hash. Usually normal MD5 is enough. If you want, you can keep the size and hash of md5 (I think it's 16 bytes).

If two files have different sizes, these are different files. You do not even need to calculate the hash from the data. If it is unlikely that you have a lot of duplicate files, and the files have a larger look (for example, JPG images taken with the camera), this optimization can save you a lot of time.

If two or more files are the same size, you can calculate the hashes and compare them.

If the two hashes are the same, you can compare the actual data to make sure it is all the same. This is very, very unlikely, but theoretically possible. The larger your hash (md5 is 16 bytes, and CR32 is only 4), the less likely it is that two different files will have the same hash. It only takes 10 minutes of programming to complete this additional check, so I would say: it's better safe than sorry. :)

To optimize this, if exactly two files are the same size, you can simply compare their data. You still need to read the files to calculate their hashes, so why not compare them directly if they are the only two with this particular size.

+1
source share

you can use BufferedImage.equals () to compare two buffered images, and for simplicity you can use BufferedImage.hashCode () to get the hash of the image, it's fast and fast.

-3
source share

All Articles