What is the shortest pair of lines causing MD5 collision?

How long can strings be used with MD5 as a hash without worrying about collisions?

This will presumably be calculated by generating an MD5 hash for each possible line in a particular character set with an increase in length until the hash appears a second time (collision). The maximum possible line length without collision will then be one character less than the longest of the oncoming pair.

Whether this has already been tested on MD5, SHA1, etc.

+57
math cryptography hash-collision md5
Jan 04 '10 at 14:27
source share
3 answers

Update

Ironically, a few weeks after I posted the previous answer, two Chinese researchers, Tao Xie and Dengguo Feng, published a new one-block clash for MD5 . Until now, I did not know about this paper. One MD5 block means that the input size is 64 bytes or 512 bits. Note that the inputs are basically the same, only 2 bits different .

Their methodology will not be published until January 2013, but their collision can be checked now using the numbers from the article:

>>> from array import array >>> from hashlib import md5 >>> input1 = array('I', [0x6165300e,0x87a79a55,0xf7c60bd0,0x34febd0b,0x6503cf04, 0x854f709e,0xfb0fc034,0x874c9c65,0x2f94cc40,0x15a12deb,0x5c15f4a3,0x490786bb, 0x6d658673,0xa4341f7d,0x8fd75920,0xefd18d5a]) >>> input2 = array('I', [x^y for x,y in zip(input1, [0, 0, 0, 0, 0, 1<<10, 0, 0, 0, 0, 1<<31, 0, 0, 0, 0, 0])]) >>> input1 == input2 False >>> md5(input1).hexdigest() 'cee9a457e790cf20d4bdaa6d69f01e41' >>> md5(input2).hexdigest() 'cee9a457e790cf20d4bdaa6d69f01e41' 

Update: An article was published in March 2013: Tao Xie and Fanbao Liu and Dengo Feng - Rapid Collision Attack with MD5

However, if you have more opportunities for the game, collisions of several kilobytes are calculated much faster - they can be calculated within an hour on ANY regular computer.

Old answer

In the previous shortest collision, at least two MD5 blocks were used, which are worth the input - these are 128 bytes, 1024 bits. The prefix in the first block can be arbitrarily selected by the attacker, the rest will be calculated and displayed as gibberish.

Here is an example of two different counter inputs, you can try it yourself in Python:

 >>> from binascii import unhexlify >>> from hashlib import md5 >>> input1 = 'Oded Goldreich\nOded Goldreich\nOded Goldreich\nOded Go' + unhexlify( ... 'd8050d0019bb9318924caa96dce35cb835b349e144e98c50c22cf461244a4064bf1afaecc582' ... '0d428ad38d6bec89a5ad51e29063dd79b16cf67c12978647f5af123de3acf844085cd025b956') >>> len(input1) 128 >>> md5(input1).hexdigest() 'd320b6433d8ebc1ac65711705721c2e1' >>> input2 = 'Neal Koblitz\nNeal Koblitz\nNeal Koblitz\nNeal Koblitz\n' + unhexlify( ... '75b80e0035f3d2c909af1baddce35cb835b349e144e88c50c22cf461244a40e4bf1afaecc582' ... '0d428ad38d6bec89a5ad51e29063dd79b16cf6fc11978647f5af123de3acf84408dcd025b956') >>> md5(input2).hexdigest() 'd320b6433d8ebc1ac65711705721c2e1' 

Generating these two specific inputs took 2 days in a Playstation 3 215-node cluster, Mark Stevens :)

+70
Dec 06 '10 at 23:01
source share

The mathematics of the paradox of the day makes the collision point of inflection approximately around sqrt (N), where N is the number of different bins in the hash function, so for a 128-bit hash, since you get about 64 bits, you will most likely have 1 collision. Therefore, I assume that for a full set of 8-byte strings, it may run into a collision probability, and for 9-byte strings this is very likely.

edit:. The MD5 hash algorithm is supposed to cause an I / O mapping to an output hash close to random. (compared to the one that distributes the strings more evenly over the set of possible hashes, in which case it will be closer to 16 bytes.)

Also for a more specific numerical answer, if you look at one of the approximations to calculate the probability of a collision, you will get

p (k) & asympt. 1 - e -k (k-1) / (2 * 2 128 ) where k = the size of the space of possible inputs = 2 m where the length of the input byte is the length of m bits.

a set of 8 byte strings: p (2 64 ) & asympt; 1 - e -0.5 & asympt. 0.3935

a set of 9 byte strings: p (2 72 ) & asympt; 1 - e -2 144 / (2 * 2 128 ) = 1 - e -2 15 = 1 - e -32768 & asympt . one

Also note that they assume a complete set of m / 8 byte strings. If you use only alphanumeric characters, you will need more bytes to get a likely collision.

+10
Jan 04 '10 at
source share

I doubt that there is any useful length where you will not have possible collisions. These algorithms are not actually used for this purpose. This meant striving to be unique to small changes in data (for example, damaged files), and not unique to all possible data sets.

+1
Jan 04 '10 at 14:50
source share



All Articles