Benchmark: CRC or Hash?

Performance and security deviations and accepting a hash function with a perfect avalanche effect that I should use for data checksums: CRC32 or hash truncated to N bytes? That is, who will be less likely to miss an error? In particular:

  • CRC32 vs 4-byte hash
  • CRC32 versus 8-byte hash
  • CRC64 vs 8-byte hash

Data blocks must be transmitted over the network and stored on disk again. Blocks can have a size from 1 KB to 1 GB.

As far as I understand, CRC32 can detect up to 32 bit flip with 100% reliability, but after that its reliability approaches 1-2^(-32) , and for some models it is much worse. The ideal 4-byte hash is always 1-2^(-32) , so go on to the picture.

An 8-byte hash should have much better overall reliability ( 2^(-64) chance to skip the error), so should this be preferable to CRC32? What about CRC64?

I think the answer depends on the type of error that might be expected in such an operation. Perhaps we will see rare 1-bit flips or massive corruption? In addition, given that most storage devices and network equipment implement some kind of CRC, shouldn't you take care of random bit flip?

+8
hash checksum crc32
source share
2 answers

Only you can tell whether 1-2 -32 is good enough or not for your application. The error detection efficiency between CRC-n and n bits from a good hash function will be very close to the same, so choose the one that is faster. This is probably CRC-n.

Update:

The above β€œThis is probably CRC-n,” it is somewhat likely. This is not the case if very high hash functions are used. In particular, CityHash looks almost as fast as the CRC-32, calculated using Intel crc32 hardware instructions! I tested three CityHash procedures and an Intel crc32 instruction in a 434 MB file. The crc32 instruction version (which calculates the CRC-32C) takes 24 ms of processor time. CityHash64 took 55 ms, CityHash128 60 ms and CityHashCrc128 50 ms. CityHashCrc128 uses the same hardware instruction, although it does not calculate CRC.

To quickly compute CRC-32C calculations, I needed to come up with three crc32 commands for three separate buffers in order to use three arithmetic logic blocks in parallel in one core, and then write the inner loop in assembler. CityHash is pretty damned fast. Unless you have a crc32 instruction, it will be difficult for you to compute a 32-bit CRC as fast as CityHash64 or CityHash128.

Note, however, that CityHash functions will need to be changed for this purpose, or arbitrary selection will be required to determine the consistent value of the CityHash value for large data streams. The reason is that these functions are not configured to receive buffered data, i.e. At the same time, they load the functions and expect to get the same result, as if the entire data set was immediately supplied to the function. CityHash functions must be changed to update an intermediate state.

An alternative and what I did for quick and dirty testing is to use the Seed versions for functions in which I would use CityHash from the previous buffer as a seed for the next buffer. The problem is that the result depends on the size of the buffer. If you load buffers with different sizes of CityHash with this approach, you get different hash values.

Another update four years later:

The xxhash family is even faster. I would recommend this for CRC for a non-critical hash.

+12
source share

Refusal of problems with "efficiency"; you may want to use one of the SHA-2 functions (e.g. SHA-256).

+1
source share

All Articles