How much can the SHA1 hash be truncated and be sure of a unique identifier?

Question

How much can the SHA1 hash be truncated and be sure of a unique identifier?

I am making an application that stores documents and gives each UID based on the SHA1 digest of several things, including a timestamp. There are many characters in the digest, and I want users to be able to identify documents using the first x characters of the full digest. What good value for x if the number of documents can be around 10K - 100K?

+11

algorithm probability hmac sha1

dan Jan 24 '11 at 16:23

source share

5 answers

Be careful with truncation, as there is no evidence that a smaller hash is safe. See Kelsey http://csrc.nist.gov/groups/ST/hash/documents/Kelsey_Truncation.pdf . Kelsey gives heuristic arguments pointing to the same thing (“Related hash exits” and “Close collisions”). Biham / Chen offers examples of close encounters; and Knudsen demonstrates Truncated Differentials.

In the end, you probably want to submit your data to a truncated size HMAC (the size is also digested by the HMAC) and then the truncated HMAC is used.

+2

jww Dec 18 '12 at 5:20

source share

This is not really a value; part of what makes SHA a good universal hashing algorithm is that such data does not necessarily lead to similar hashed values. It’s best (without knowing anything about your system) to simply look for a list of documents whose hashes begin with a value provided by the user, or present them with a list of documents for selection or go directly to the document if there is only one.

+1

Adam Robinson Jan 24 '11 at 16:27

source share

This is a generalization of the birthday problem. In your case, n is the number of documents, and instead of the constant 365 you will have the number of possibilities that circumcision gives (so for k bits this is 2 ^k ).

Of course, an exact calculation is out of the question, but you can use approximation .

+1

vartec Jan 24 '11 at 16:30

source share

Well, the answer here may be too simplistic.

If with full sha1 you get about 1 in 2 ^ 160 chance of a collision, then truncating one character, you increase the chance of a collision by 16 (all possible values of the truncated character) ... which is 2 ^ 4. So, if you truncate x characters , you get a 1 out of 2 ^ (160 - 4 * x) chance of a collision. Right?

0

vmpstr Jan 24 '11 at 16:27

source share

bdonlan · Accepted Answer · 2011-01-24 16:33

Adapting the formulas on Wikipedia for the birthday problem , you can approximate the probability of a collision like e^(-n^2/(2^(b+1))) , where n number of documents and b is the number of bits. A graph of this formula with n = 100,000 , it looks like you will need b> 45 at least. I would be more inclined to go with 64 to make it pleasant and round. However, do you have a plan to deal with collisions if they happen (maybe slightly change the timestamp or add nonce?)

In this case, if sha1 is based not only on the content of the document, why not just make it a random ID? In this case, collisions are not a problem, since you can always generate a new random number and try again (the probability of a collision with one attempt is the same).

How much can the SHA1 hash be truncated and be sure of a unique identifier?

More articles: