How to quickly remove a URL

Question

How to quickly remove a URL

I have a unique situation where I need to produce a hash on the fly. Here is my situation. This question is related to here . I need to store many URLs in the database that need to be indexed. URL can contain more than 2000 characters. The database complains that a string with more than 900 bytes cannot be indexed. My solution is to hash the url using MD5 or SHA256. I'm not sure which hash algorithm to use. Here are my requirements

Shortest character length with minimal collision
You need to be very fast . I will hash the referurl with every page request
Collisions need to be minimized since I can have millions of URLs in the database

I am not worried about security. I am worried about the length, speed and collisions of the characters. Does anyone know of a good algorithm for this?

+7

c # algorithm c # -4.0 hash

Luke101 Oct 18 '11 at 15:31

source share

7 answers

Use the System.Security.Cryptography.SHA1Cng class, I would suggest. It is 160 bits or 20 bytes in length, so it should definitely be small enough. If you need a string, it will only contain 40 characters, so this should suit your needs well. It should also be fast enough, and as far as I know, no collisions have yet been found.

0

Ryan Oct 18 '11 at 15:37

source share

~~I personally used String.GetHashCode () . This is the main hash function. I honestly don't know how this works compared to other implementations, but it should be fine.~~

Any of the two hash functions that you name should be fast enough so that you don't notice a big difference between them. If this site does not require ultra-high performance, I would not worry too much about them. I personally will probably go to MD5. This can be formatted as a string as a 64-character hexadecimal digit, or as a base 64-string with 44 characters.

The reason for MD5 is that you are unlikely to run into conflicts, and even if you do, you can structure your queries with "where urlhash = @hash and url = @url". The database engine must work to index, and the other should not and use this information for a reasonable search.

If there are collisions, an indexed scan on urlhash will return several results that will easily be compared with the text to get the correct one. This is unlikely to be relevant very often. You have a pretty low chance of getting a collision this way.

0

Chris Oct 18 '11 at 15:37

source share

quick:

 URLString.GetHashCode().ToString("x")

0

Birey Oct 18 '11 at 15:40

source share

Although both MD5 and SHA1 have proven ineffective when collision avoidance is essential, I suspect that your application will be sufficient. I do not know for sure, but I suspect that MD5 will be simpler and faster of the two algorithms.

0

samjudson Oct 18 '11 at 15:41

source share

GetHashCode function reflected source code in .net 4.0

 public override unsafe int GetHashCode() { fixed (char* str = ((char*) this)) { char* chPtr = str; int num = 0x15051505; int num2 = num; int* numPtr = (int*) chPtr; for (int i = this.Length; i > 0; i -= 4) { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0]; if (i <= 2) { break; } num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1]; numPtr += 2; } return (num + (num2 * 0x5d588b65)); } }

There were O (n) simple operations (+, <<, ^) and one multiplication. So it is very fast.

I tested this function for 3 million databases, it contains strings up to 256 characters long and about 97% of strings have no collisions. (Maximum 5 lines have the same hash)

0

Ivan Bianko Oct 18 '11 at 15:50

source share

You can see the following project:

CMPH - C Minimum Perfect Hashing Library

And check out the list of hot topics for perfect hashes:

Hot Ideal Hash Answers - Stack Overflow

You can also use the full text index in SQL, rather than hashing:

CREATE FULLTEXT INDEX (Transact-SQL)

0

Jamiesee Oct 27 '11 at 16:02

source share

MartinStettner · Accepted Answer · 2011-10-18T15:47:11+0000

In your case, I would not use any cryptographic hash functions (e.g. MD5, SHA), since they were designed with security in mind: they basically want to make it as difficult as possible to find two different lines with the same hash. I think this will not be a problem in your case. (the probability of random collisions is inherent to hashing, of course)

I would not suggest using String.GetHashCode() , as the implementation is unknown, and MSDN says that it may vary between different versions of the framework. Even the results between x86 and x64 versions may vary. Thus, you will encounter problems when trying to access the same database using a new (or different) version of the .NET framework.

I found the hashCode Java implementation algorithm on Wikipedia ( here ), it seems pretty simple to implement. Even a simple implementation will be faster than an MD5 or SHA imo implementation. You can also use long values that reduce the chance of collisions.

There is also a brief analysis of the .NET implementation of GetHashCode here (not the algorithm itself, but some implementation details), you can also use this one think. (or try to implement the Java version in a similar way ...)

How to quickly remove a URL

More articles: