C: Generating hash keys for large datasets?

Question

C: Generating hash keys for large datasets?

I am currently playing with hashing and key generation, trying to create my own hash key generator.

At the moment I have a list of 90,000 lines (every 1 word and another word). I was wondering what is the best way to generate keys (numeric keys, not string keys)?

Currently, depending on the word last ascii, I am doing a calculation based on the meaning of the letter.

As a result, about 50% of the words generate a key that collides with another.

I used quadratic probing to then find a place in the table for the rest of the words.

My question, as above, is usually the best way to generate a key for 90,000 different words? I know that the larger the data set, the more likely collisions will be, but how would you suggest / or minimize collisions?

Edit: Also - I don't need cryptography, it just needs to be fast.

Thanks.

+4

c key hash

user2013417 Feb 14 '13 at 19:44

source share

5 answers

To prevent collisions, you need a good hash key generator.

Several algorithms are available. One recent and very fast one is called xxHash . It is written in C.

+4

Matt Feb 14 '13 at 20:20

source share

There can be no good choice, having chosen a hash table size of 90,000, there is a much better concept of perfect hashing, according to this, double hashing is used to search the table, and the other to maintain the list, you should try the multiplication method for both, I think what a good idea.

0

Atiq Feb 14 '13 at 19:53

source share

I have seen using Knut:

  register int h,k; register char *p; for (h=0,p=w;*p;p++) h=(*p+h+h)%hash_prime;

Where hash_prime is just more than 4x, the expected number of active entries in the hash table.

See: Knuth literateprogramming.com , an example of Adventure.

Here's the hash code in context:

 #define hash_prime 1009/* the size of the hash table */ typedef struct { char text[6]; /* string of length at most 5 */ char word_type; /* a |wordtype| */ char meaning; } hash_entry; hash_entry hash_table[hash_prime]; /* the table of words we know */ void new_word(w,m) char *w; /* a string of length 5 or less */ int m; /* its meaning */ { register int h,k; register char *p; for (h=0,p=w;*p;p++) h=(*p+h+h)%hash_prime; while (hash_table[h].word_type) { h++;if (h==hash_prime) h=0; } int lookup(w) char *w; /* a string that you typed */ { register int h; register char *p; register char t; t=w[5]; w[5]='\0'; /* truncate the word */ for (h=0,p=w;*p;p++) h=(*p+h+h)%hash_prime; /* compute starting address */ w[5]=t; /* restore original word */ if (h<0) return -1; /* a negative character might screw us up */ while (hash_table[h].word_type) { if (streq(w,hash_table[h].text)) return h; h++;if (h==hash_prime) h=0; } return -1; }

Please note that this code:

  register char t; // . . . t=w[5]; w[5]='\0'; /* truncate the word */ // . . . w[5]=t; /* restore original word */

For a specific requirement, you only need to look at the first 5 characters and it must be deleted so that you hash the entire word.

0

jimhark Feb 14 '13 at 20:15

source share

The term you want is an avalanche - a hash function that provides the optimal spread.

If you want your keys to be unique, and if your dataset has zero duplicates, then you can convert your word as base36 to base10. If you use stroull (), you can return really large integers.

 char *p=myword; for(; *p; p++) *p=toupper(*p); unsigned long long key=strtoull(myword, NULL, 36);

This can overflow and return a positive number. Some hashes with a given long string can overflow a 32-bit integer. Kerneghan hash and Bernstein hash do this.

In reality and as indicated by several other people:

Note that collisions are a function of size hash_table and an avalanche of hash functions modulo hash_table. Instead of truly unique keys, what you want may be a better algorithm and hash_table size.

0

jim mcnamara Feb 15 '13 at 22:16

source share

dasblinkenlight · Accepted Answer · 2013-02-14T19:50:24+0000

You can "borrow" the implementation of the Java String hashCode ^* :

 int hashCode(const char* s) { int h = 0; while (*s) { h = 31*h + (*s++); } return h; }

This function provides reasonable separation and is among the most widely used hash functions.

^* which, as it turns out , Java, in turn, is "borrowed" from the book of Kernigan and Richie on programming in C.

C: Generating hash keys for large datasets?

More articles: