Hash code calculation for large file in parallel

I would like to improve the hashing performance of large files, for example, tens of gigabytes in size.

Usually you sequentially hash bytes of files using a hash function (say, for example, SHA-256, although I will most likely use Skein, so hashing will be slower compared to the time it takes to read the file from the [fast] SSD ) Let me call this method 1.

The idea is to hash multiple blocks of a 1 MB file in parallel on 8 processors, and then hash-concatenated hashes into one final hash. Let me call this method 2.

The image depicting this method follows:


enter image description here


I would like to know how this idea sounds and how much β€œsecurity” is lost (in terms of the likelihood of more likely collisions), and makes one hash throughout the file.

For example:

Let us use the SHA-256 version of SHA-2 and set the file size to 2 ^ 34 = 34,359,738,368 bytes. Therefore, using a simple one pass (method 1), I would get a 256-bit hash for the whole file.

Compare this to:

Using parallel hashing (i.e., method 2), I would split the file into 32,768 blocks of 1 MB each, replace these blocks with SHA-256 with 32768 hashes of 256 bits (32 bytes), combine the hashes and execute the final hash of the final combined dataset 1,048,576 bytes to get my last 256-bit hash for the whole file.

Is method 2 less secure than method 1 in terms of potential collisions and / or probabilities? Perhaps I should rephrase this question as follows: Is the 2nd method easier for an attacker to create a file that hashes with the same hash value as the original file, except, of course, for the trivial fact that brute force attack will be cheaper since the hash can be computed in parallel on N cpus?

Update . I just found that my construct in method 2 is very similar to the concept of a hash list . However, the Wikipedia article referenced in the previous sentence does not contain detailed information about the superiority of the hash list or inferiority regarding the likelihood of collisions compared to method 1, the simple old hashing of the file when only the top hash of the hash list.

+7
source share
3 answers

Block-based hashing (your method 2) is a well-known method that is used in practice:

Like what you do, these methods again take a list of block hashes and hashes, down to one short hash. Since this is a well-established practice, I would suggest that it is as safe as serial hashing.

+7
source

Some modern hash designs allow them to work in parallel. See Effective Parallel Algorithm for Hash Host Functions . If you want to use the new (and therefore less thoroughly tested) hash algorithm, this can give you the speed increase you want on a multiprocessor machine.

Skein entered the final stage of the NIST SHA-3 contest, so it is not completely untested.

+4
source

I think it would be much easier for an attacker to find a collision, because the time required to generate a hash depends on the size of the data for the hash. One of the great features of cryptographically secure hashes is that the attacker cannot take your 100Gb file, find the place they want to change, hash everything before and after this block, and then use these pre-calculated values ​​to quickly get the hash of the whole file after small / quick permutations in the bit of interest to them. This is because they cross a sliding window in a hashing algorithm.

In short, if you edit the middle of the file, you still need to hash the entire file to get the final checksum. Thus, a 100Gb file takes much longer to find a collision than a 100-byte file. The exception is that editing is pointless at the end of the file, so this is often the case for "in-the-wild" in conflict examples for large files.

However, if you break the source file into blocks, the attack speed is now a function of the smallest block (or the size of the block you want to change). As the file size increases linearly with hashing time, a 100 GB file will take approximately 2,000 seconds for each permutation / MD 5, and a 1 MB block will allow an attacker to try 50 per second .

The solution is to split the file into overlapping chunks, and then MD5 into these chunks separately. The resulting hash will be a concatenation of the hashes at the beginning, at the end, and at the end. Now collision search requires the entire file to be hashed - albeit in a parallel way.

0
source

All Articles