Hash code and checksum - what's the difference?

My understanding is that a hash code and a checksum are similar things - a numerical value computed for a data block that is relatively unique.

i.e. The probability of two data blocks giving the same numerical hash / checksum value is low enough that it can be ignored for application purposes.

Do we have two words for the same thing, or are there important differences between hash codes and checksums?

+78
language-agnostic computer-science hash checksum
Jan 20 '09 at 9:28
source share
9 answers

I would say that checksum is necessarily a hash code. However, not all hash codes do good checksums.

The checksum has a special purpose - it checks or checks the integrity of the data (some may go beyond this, allowing the correction of errors ). Good checksums are easy to compute and can detect many types of data failures (for example, one, two, three error bits).

A hash code simply describes a mathematical function that maps data to a certain value. When used as an indexing tool in data structures (for example, a hash table), a low probability of collision is required.

+45
Jan 20 '09 at 9:31
source share

For each of them there is another purpose:

  • Hash code - designed for randomness in your domain (to minimize conflicts in hash tables, etc.). Cryptographic hash codes are also designed to work flawlessly in the reverse order.
  • Checksum - designed to detect the most common errors in the data and is often quickly calculated (to effectively check fast data streams).

In practice, the same functions are often good for both purposes. In particular, a cryptographically strong hash code is a good checksum (it is almost impossible for a random error to violate a strong hash function) if you can afford the computational cost.

+32
Jan 20 '09 at 9:43
source share

There are indeed some differences:

  • The checksums should be different if the input is different (as often as possible), but it is almost as important as it is quickly calculated.
  • Hash codes (for use in hash tables) have the same requirements, and additionally they must be evenly distributed over the code space, especially for inputs similar to.
  • Cryptographic hashes have much more stringent requirements that specify a hash; you cannot create an input that creates this hash. The computation time takes second place, and depending on the application, it may be desirable that the hash be very slow to compute (to combat brute force attacks).
+18
Jan 20 '09 at 9:46
source share

Wikipedia puts it well:

Checksum functions are associated with a hash of a function, fingerprints, randomization of a function, and a cryptographic hash of a function. However, each of these concepts has different applications and, therefore, different design goals. Check digits and parity bits are special cases of checksums, suitable for small data blocks (for example, social security numbers, bank account numbers, computer words, single bytes, etc.). Some error correction codes are based on special checksums that not only detect common errors, but also allow the original data to be restored in some cases.

+8
Jan 20 '09 at 9:32
source share

Hash codes and checksums are used to create a short numeric value from a data item. The difference is that the value of the checksum must change, even if a small modification has been made to the data element. For a hash value, the requirement is only that real-world data elements must have different hash values.

A clear example is strings. The checksum for the string must include every bit and order. On the other hand, a hash code can often be implemented as a checksum of a prefix of limited length. This would mean that "aaaaaaaaaaba" would hash in the same way as "aaaaaaaaaaaab", but hashing algorithms could deal with such collisions.

+4
Jan 20 '09 at 9:46
source share

The checksum protects against accidental changes.

A cryptographic hash protects against a very motivated attacker.

When you send bits on a wire, it may happen that some bits are either upside down, deleted, or inserted. To allow the receiver to detect (or sometimes correct) accidents like this, the sender uses a checksum.

But if you assume that someone is actively and intelligently modifying the message on the wire, and you want to protect it from this type of attacker, then use a cryptographic hash (I ignore the cryptographic hashing of the hash or the use of a secondary channel or such, since the question does not seem to be eludes this).

+3
Nov 20 '14 at 20:14
source share

These are interchangeable these days, but in the daytime the checksum was a very simple technique where you added all the data up (usually in bytes) and anchored the bytes at the end with this value in .. then you hope to find out if any either from the source data. Like a check bit, but with bytes.

+2
Jan 20 '09 at 9:33
source share

I try to use a word checksum when referring to a code (numeric or otherwise) created for a file or a piece of data that can be used to verify that the file or data has not been corrupted. The most common use that I encounter is to check that files sent over the network have not been modified (intentionally or otherwise).

+1
Jan 20 '09 at 9:38
source share

The difference between a hash code and checksum functions is that they are designed for different purposes.

  • The checksum is used to determine if something at the input has been changed.

  • The hash code is used to determine if something in the input has changed and to have as much "distance" between the individual values ​​of the hash code as possible.

    In addition, there may be additional requirements for the hash function as opposed to this rule, for example, the possibility of early generation of trees / clusters / hash code value codes.

    And if you add some general early randomization, you will move on to the concept of modern encryption / key exchange.




About probability:

For example, suppose the input actually always changes (100% of the time). And let's assume that you have a “perfect” hash / checksum function that generates a 1-bit hash / checksum value. This way you get different hashes / checksums, 50% of the time, for random input.

  • If exactly 1 bit in your random input has changed, you can detect this 100% of the time, no matter how large the input is.

  • If 2 bits in your random input have changed, your probability of detecting a “change” is divided by 2 because both changes can cancel each other out and no hash / checksum function will detect that 2 bits actually differ in the input.

    ...

This means that if the number of bits in your input is several times larger than the number of bits in your hash / checksum value, the probability of actually getting different hash / checksum values ​​for different input values ​​is reduced and is not constant .

+1
Feb 27 '14 at 4:35
source share



All Articles