A dictionary with two hash functions in C #?

I have a huge (โ†’ 10 m) list of records. Each entry offers two hash functions:

  • Cheap: hash quickly computes, but its distribution is terrible (can put 99% of items in 1% of the hash space)
  • Expensive: it takes a long time to calculate, but the distribution is much better.

A regular dictionary allows me to use only one of these hash functions. I would like to use a dictionary that first uses a cheap hash function and checks for an expensive collision situation.

It seems like a good idea to use a dictionary inside dictionaries for this. I currently mainly use this monster:

Dictionary<int, Dictionary<int, List<Foo>>>; 

I improved this design, so an expensive hash is only called if there are actually two elements of the same cheap hash.

It fits perfectly and does an impeccable job for me, but it looks like it should have passed away 65 million years ago.

As far as I know, this functionality is not included in the basic structure. I'm going to write a DoubleHashedDictionary class, but first want to know your opinion.

As for my specific case:
First hash function = number of files in the file system directory (fast) Second hash function = sum of file sizes (slow)

Editing:

  • The name has been changed and more information has been added.
  • Pretty important missing details added.
+4
source share
4 answers

First, I think that you are on the right track to implement your own hash table if what you are describing is really desirable. But as a critic, I would like to ask a few questions:

Have you considered using something more unique for each record?

I assume that each entry is file system directory information, did you consider that you use the full path as a key? prefix with computer name / IP address?

On the other hand, if you use the number of files as a hash key, will these directories never change? Because if the hash key / result changes, you can never find it.

While in this section, if the contents / size of the directory never changes, can you save this value somewhere to save time in order to really calculate this?

Only my little cents.

+1
source

In your case, you technically use a modified function (A | B), not a double hashed one. However, depending on how huge your "huge" list of records is and the characteristics of your data, consider the following:

  • A 20% complete hash table with a poor distribution may have a collision probability of more than 80%. This means that the expected cost of the function can be: (0.8 more expensive + 0.2 cheap) + (search cost). Therefore, if your table is more than 20% full, you might not want to use the scheme (A | B).

  • It is possible to create an ideal hash function, but O (n ^ 3), which makes it impractical.

  • If performance is paramount, you can create a custom hash table for your specific data by testing various hash functions on your key data.
+2
source

Have you looked at Power Collections or C5 Collections ? Currently, the Power Collections library does not have a lot of activities, but the C5 stuff seems pretty modern.

Iโ€™m not sure that the library has what you need, but they are very useful and they are open source, so they can provide a decent basic implementation so that you can expand your desired functionality.

+1
source

Basically you are talking about a hash table of a hash table, each of which uses a different implementation of GetHashCode ... while maybe I think you would like to seriously think about whether you really get a performance boost for simple execution or another ...

Will there really be a significant number of objects that will be located through the quick hash mechanism, without resorting to a more expensive one, to narrow it further? Because, if you cannot find a significant amount solely from the first calculation, you really do not save anything by doing this in two stages (without knowing the data that is difficult to predict whether this is so).

If this will be a significant amount located in one step, you will probably have to make a small adjustment to determine how many records to keep at each external hash position before resorting to the internal โ€œexpensiveโ€ hash list rather than processing the hashed ones data, but under certain circumstances, I can see how you get a performance boost from this (circumstances will be small and far apart, but unthinkable).

Edit

I just saw your reaction to the question - are you planning to do both searches independently ... I doubt that you will get any performance benefits from this that you cannot get by simply setting up the main hash table a little better. Have you tried using a single dictionary with the appropriate bandwidth passed in the constructor, and maybe the XOR of two hash codes as a hash code?

+1
source

All Articles