Pros and cons of using MD5 Hash as the primary key or using the int identifier as the primary key in SQL Server

I have an application for working with a file and fragmenting it into several segments, and then save the result in the sql server database. There are many duplicate files (possibly with different file paths), so first I look through all these files and calculate the Md5 hash memory for each file and mark the duplicate file using the [Duplicated] column.

Then every day I launched this application and saved the results in the [Result] table. The db schema is as follows:

CREATE TABLE [dbo].[FilePath] ( [FilePath] NVARCHAR(256) NOT NULL PRIMARY KEY, [FileMd5Hash] binay(16) NOT NULL, [Duplicated] BIT NOT NULL DEFAULT 0, [LastRunBuild] NVARCHAR(30) NOT NULL DEFAULT 0 ) CREATE TABLE [dbo].[Result] ( [Build] NVARCHAR(30) NOT NULL, [FileMd5Hash] binay(16) NOT NULL , [SegmentId] INT NOT NULL, [SegmentContent] text NOT NULL PRIMARY KEY ([FileMd5Hash], [Build], [SegmentId]) ) 

And I have a requirement to join this 2 table on FileMd5Hash.

Since the number of rows in [Result] is very large, I would like to add an int identifier column to associate them with tables, as shown below:

  CREATE TABLE [dbo].[FilePath] ( [FilePath] NVARCHAR(256) NOT NULL PRIMARY KEY, [FileMd5Hash] binay(16) NOT NULL, **[Id] INT NOT NULL IDENTITY,** [Duplicated] BIT NOT NULL DEFAULT 0, [LastRunBuild] NVARCHAR(30) NOT NULL DEFAULT 0 ) CREATE TABLE [dbo].[Result] ( [Build] NVARCHAR(30) NOT NULL, **[Id] INT NOT NULL,** [SegmentId] INT NOT NULL, [SegmentContent] text NOT NULL PRIMARY KEY ([FileMd5Hash], [Build], [SegmentId]) ) 

So what are the pros and cons of these two ways?

+8
sql database sql-server hash
source share
3 answers

The int key is simpler to implement and easier to use and understand. It is also smaller (4 bytes versus 16 bytes), so the indexes will correspond to twice the number of entries on the I / O page, which means better performance. The rows of the table will also be smaller (OK, not much smaller), so again you put more rows on the page = less IO.

A hash can always cause collisions. Although extremely rare, nonetheless, as the birthday issue shows, collisions become more likely as the number of entries increases. The number of elements needed for a 50% chance of colliding with various hashes of bit lengths is as follows:

 Hash length (bits) Item count for 50% chance of collision 32 77000 64 5.1 billion 128 22 billion billion 256 400 billion billion billion billion 

Also there is the problem of transferring bytes without ascii - it is more difficult to debug, send by wire, etc.

Use int sequential primary keys for your tables. Everyone else does.

+8
source share

Here is a very good article explaining the pros and cons of using both:

http://databases.aspfaq.com/database/what-should-i-choose-for-my-primary-key.html

Using an MD5 hash will be similar to using a GUID for your primary key. Hash conflicts are rare, but you might want to handle this.

I will personally migrate from INT IDENTITY, but it may vary depending on your implementation.

+1
source share

Use ints for primary keys, not for hashes. Everyone warns of hash collisions, but in practice this is not a big problem; Easily check for collisions and re-hash. Serial identifiers can also be encountered if you are joining databases.

The big problem with hashes as keys is that you cannot modify your data. If you try, your hash will change and all foreign keys will become invalid. You have to create the "no, this is real hash" column in your database, and your old hash will just become a large, non-tracking integer.

I bet your business analyst will say, "We are implementing WORM so our records never change." They will be wrong.

0
source share

All Articles