CRC32 Collision

I am trying to find a collision between two posts that will result in the same CRC hash. Given that I am using CRC32, is it possible to somehow reduce the list of possible messages that I should try when conducting an attack by brute force attack?

Any links to sites with tips on this will be useful. I already have an enumeration algorithm that will do this, but it just increments integers and checks to see if it matches other hashes.

+11
source share
6 answers

It totally depends on what you mean by "message." If you can add four bytes of delirium to one of the messages. (That is, four bytes that do not make sense in the context of the message.) Then it becomes trivial in the truest sense of the word.

Thinking in terms of bits moving through a CRC32 state machine.

CRC32 is based on the Galois feedback shift register, each bit in its state will be replaced with 32 bits from the payload data. When each bit is entered, the positions indicated by the polynomial will be excluded with the sequence observed from the end of the shift register. This sequence is not affected by the input until the shift register is full.

As an example, imagine that we have a shift register filled with the initial state 10101110, polynomial 10000011 and filling with unknown bits of X.

Polynomial * ** |feedback (End of SR.) State 10101110 0 State X1010111 1 State XX101000 0 State XXX10100 0 State XXXX1010 0 State XXXXX101 1 State XXXXXX01 1 State XXXXXXX1 1 State XXXXXXXX 0 

Feedback is not in terms of X until the SR is filled out! Thus, in order to generate a message with a predetermined checksum, you take a new message, generate its CRC and process its next 32 feedback bits. You can do this in 32 steps of the CRC function. Then you need to calculate the effect of this feedback on the contents of the shift register.

To do this, you need to add four zero bytes to the message, and then look at the checksum. (The checksum is the SR state at the end, which, if padded with four zero bytes, is the effect of feedback and empty bytes.)

An exceptional OR that affects the desired checksum value, replace the four-byte trailer with this calculated value and restore the checksum. You can do this with any program that generates CRC32, a hex editor and a calculator that can handle hexadecimal values.

If you want to generate two messages that make full sense and do not contain trailing garbage, things get a little more complicated. Determine the number of sections that you can write plausible alternatives with the same length.

Using English prose as an example. “I think it might work” and “I believe in such an approach” have the same meaning and the same length in general.

Defining enough examples in your message is a tricky bit (unless you want to spoof spaces!) CRC 32 is linear if the data has the correct offset in the message. So CRC ([messagea] [padding]) ^ CRC ([padding] [messageb]) = CRC ([messagea] [messageb]) There are a few caveats with word alignment that you need to deal with as a general hint. You want to expand excerpts on the "fixed" parts of the message. Typically, you want to have alternatives for n * 1.5 passages, where n is the size of the CRC.

Now you can calculate the CRC that has the skeletal message, the impression that each alternate passage will have, and then compile a table comparing the effect that each alternative will have for each passage. Then you need to choose alternatives that will change the CRC of the skeleton to match the CRC that you want. This problem is actually quite funny to solve. First of all, find any alternatives that unambiguously change a little, if this bit needs to be changed for your CRC, select this alternative and add its effect to the CRC, and then go around again. This should reduce the space of the solution you need to look for.

This is a pretty difficult task to encode, but it will lead to collisions in a very short amount of time.

+19
source

Apart from the errors in my calculus, the probability that no collisions will be found after N trials is approximated in the following table:

  N probability
 ------- -----------
  50,000 74.7%
  77,000 50.1%
  78,000 49.2%
 102,000 29.8%
 110,000 24.5%
 128,000 14.8%
 150,000 7.3%
 200,000 0.95%

In other words, the probability of calculating more than 200,000 CRC32 values ​​before finding a duplicate is less than 1%, or the probability of finding a duplicate up to 102,000 attempts is 70.2%.
By the way, this is wonderful, because the probability of finding one collision, say, in the 200,000th attempt itself is still about 1 / 1000th of 1% ((4M - 200,0000) / 4M), but probably found one collision before the 200,000th attempt is quasi-certainty (well, in any case, above 99%). This shows an interest in maintaining the CRC database designed to date .

Of course, we could spend some time studying the CRC32 algorithm and the underlying mathematics, trying to find messages that are more likely to cause CRC32 collisions , but the relatively small number of really random attempts needed to find at least one collision with quasi-definiteness does this kind of cryptanalysis is hardly worth the effort. For example, if we assume that we can find a way to select messages that are 10 times more likely to collide with each other, we still have to try about 63,000 times before reaching a 99% probability of having at least one collision (better than 200,000 but still requires roughly the same type of application.)
The only thing we can consider in this area is to avoid messages shorter than 4 bytes (I read somewhere that CRC32 was bijective in this message space) and avoid too similar messages (that is, only one or two characters differ) since after CRC32 the initial goal is to detect (and possibly automatically correct) such small differences in messages.

Therefore, it seems that the complexity of the assignment is not so much in finding ways to calculate CRC32 at an incredible speed (although we should not be too slow with this), but rather in managing a fast-searching database of up to 200,000 messages (or "key" of the message, more on this below) and the corresponding value of CRC32.

A few ideas for realizing all this

  • You need a simple ISAM library or a better formal DBMS interface such as MySql or even SqlLite.
  • Using a pseudo-random number generator (PRNG), to create messages, we can save the message keys (i.e., everything that we pass to the PRNG to create this message), and not save the entire message. This would make inserting and searching the database more efficient, at the risk of mistakenly choosing PRNGs (or rather random numbers based on the message generator), that is, one that could generate (initially) messages that are somehow less likely to CRC32- face ...
  • It is probably better to work in batch mode, that is, produce, say, 1000 new CRCs, and then check for collisions and save them, instead of having to do all of these actions for one CRC at a time. This is especially true if we use off-the-shelf DBMS
+14
source

I assume that you mean “message” instead of “key”.

If you are allowed to choose both keys, then brute force will still be pretty fast due to the birthday paradox. Choose random messages, calculate their CRC, memorize them all and the associated CRC, and each new one has more and more chances to encounter the existing one as they accumulate. Honestly, I expect this approach to be faster on a modern computer than looking for known approaches to make CRC32 collide.

0
source

I believe CRCs are linear, so if you change (in place, without changing the length) two different parts of your file, the differences in CRC should be brought together.

- correction: everything is not so simple. However, this is still the way I would use it when trying to build a conflict - you have to follow the math in more detail than I am inclined to do tonight ...

0
source

Just yesterday, there was such a question on SO , a couple of pointers mentioned there can help you.

0
source

By brute force, you need about 6 messages of random length sqrt for a hash of size N to get a 95% chance of a collision. For example, CRC32, N = 2 ^ 32, you need about 160,000 messages.

0
source

All Articles