OK to use only 64-bit sha1 hash as identifier?

Question

OK to use only 64-bit sha1 hash as identifier?

1) For really low hash collisions, can I get away just by using half 128 bits of sha1 and not having to deal with sha1 itself? I understand that this is not suitable for cryptographic hashes, but I just need hashes for the keys of the hash table.

2) Calculation time is not a priority, and in addition, I collect very small pieces of data. In particular, I'm basically going to take 2 or 3 64-bit hashes and hash them to get another 64-bit hash. Is there a better option than sha1 for this purpose? Again, collisions should be very unlikely.

3) I am sql newb. Is it a good idea to use 64 bit hashes as id in sql? Will a 64-bit id cause performance issues in sqlite or postgres? I would need to coordinate data across several databases (including the Lucene index), so I decided that I should process the hashes directly in the tables, and not worry about auto-incrementing identifiers (which would only make sense in one db, and not in all data stores ) I believe the 64-bit bit is a good compromise: big enough for unlikely collisions, but it saves space (and search time?).

4) What about the CRC-64? Does this get a random distribution?

+6

hash sha1

Jegschemesch Apr 16 '09 at 3:31

source share

5 answers

For a good comparison of hash lengths, check out http://en.wikipedia.org/wiki/List_of_hash_functions

Also, just a note: SHA-1 is 160 bits, not 128.

+4

Smashery Apr 16 '09 at 3:41

source share

Your keys will require absolute uniqueness, not a high probability of uniqueness. I would suggest using a GUID instead of hashes for your keys for compatibility between databases. Create a hash as a quick search mechanism - it may have a unique code, but in case of a collision you will have to compare the actual data to make sure that they are the same. When synchronizing your databases, you can check the hash (quickly using the index), and if you encounter a collision, then allow whether the data is the same, and therefore the GUID should be allowed. If there is no collision, then simply update any required database for the missing record and insert it using the GUID from another database.

I also see little point in creating my own hash of hashes to save space. If you already have other hashes, just use them (add, don't rephrase). If not, just use a standard hash function like MD5 or SHA1 and save the data.

+3

tvanfosson Apr 16 '09 at 3:49

source share

With 64-bit hashes, you have a 1% chance of collision with 6.1 × 10 ⁸ records. (For other combinations, see the Wikipedia article on the birthday issue .) You can throw the first 64-bit or the last of every second bits, this has nothing to do with the properties of the hash.

+2

mjs May 25 '11 at 10:39

source share

If the calculation time is not important, why not go all 128 bits? Is there any real reason for choosing 64 bits next to possible storage issues? (and then the extra 8 bytes are not going to kill you with such cheap memory)

64 bits versus 128 bits will not cause speed problems in SQLite, I'm not sure about mySQL.

0

Gregor brandt Apr 16 '09 at 3:38

source share

dwc · Accepted Answer · 2009-04-16T04:04:18+0000

If you have few records, it is almost certain that you will never have a 64 bit hash collision. You will likely fall into this category.

There should be no problem trimming a cryptographic hash such as sha1, because if the hash has an internal structure, it will not be good enough to be a crypto hash, and if there is no structure, then any subset of the bits should be pretty random. Please note that I am only talking about using this for identifiers, and not for cryptographic purposes!

But really, doesn't your SQL have its own GUID? And if so, why not use it?

OK to use only 64-bit sha1 hash as identifier?

More articles: