When designing a hash_table, how many aspects should I look at?

I have some aspects of the candidate:

  • An important hash function, the hash code should be as unique as possible.
  • An important database structure, search, insert, and delete operations should have O (1) time complexity.
  • Memory management is important, the memory overhead of each hash_table entry should be as low as possible. When hash_table expands, memory must be efficiently expanded, and when hash_table is reduced, memory must efficiently garbage collect. And with these memory operations, aspect 2 also needs to be filled completely.
  • If hash_table will be used in multi_threads, it must be thread safe and efficient.

My questions:

  • Are there any other aspects worthy of attention?
  • How to create hash_table to fully fill these aspects?
  • Are there any resources I can link to?

Thank you very much!



After reading some materials, update my questions. :)


In a book explaining the SGI STL source code, I found some useful information:
  • The backend data structure is the bucket of a linked list . When searching, inserting or deleting an item in hash_table:
    • Use the hash function to calculate the corresponding forging position. , and the elements saved to linked list after this position .
    • If the size of the elements is larger than the size of buckets , buckets need to be resized : expand which is 2 times the size of the old one. The bucket size should be prime . Then copy the old buckets and items to the new one.
    • I did not find the logic of garbage collection when the number of elements is much less than the number of buckets . But I think that this logic should be considered when many are inserted first, and then many are deleted later.
  • Other data structures, such as arrays with linear detection or are not as good as a linked list .
  • Good hash function can avoid clusters , and double hash can help resolve clusters . .

The question about multi_threads remains open .: D


+4
source share
2 answers

There are two (slightly) orthogonal questions.

Although the hash function is obviously important, in general, you separate the backend design from the hash function design:

  • the hash function depends on the data to be stored
  • The backend is dependent on storage requirements.

For hash functions, I would suggest reading CityHash or MurmurHash (with description in SO ).

For internal use, there are various problems, as you noted. Some notes:

  • Are we talking about medium or worst difficulty? Without perfect hashing, achieving O (1) is almost impossible, as far as I know, although the frequency and complexity of the worst case can be significantly weakened.
  • Are we talking about amortized complexity? The amortized complexity as a whole provides the best throughput due to bursts. Linear switching, due to a slightly lower bandwidth, will give you a smoother curve.
  • As for multithreading, please note that the Read / Write pattern may affect the solution, given the extreme cases: 1 manufacturer and 99 readers are very different from 99 manufacturers and 1 reader. In general, records are more difficult to parallelize, since they may require a change in structure. In the worst case, they may require serialization.
  • The garbage collection is rather trivial in the amortized case, with linear processing it is a bit more complicated, but probably the least difficult part.

You never talked about the amount of data you intend to use. Writers can update different buckets without interfering with each other, so if you have a lot of data, you can try to distribute them to avoid competition.

Literature:

  • Wikipedia article reveals many different implementations, always good to look at the variety
  • This GoogleTalk by Dr Cliff (Azul Systems) shows a hash table designed for highly multi-threaded Java systems.
+3
source

I suggest you read http://www.azulsystems.com/blog/cliff/2007-03-26-non-blocking-hashtable

The link points to a blog from Cliff Click that has a hash entry. Some of his findings:

  • To go from hash to index, use the AND binary instead of modulo prime. It is many times faster. The size of your table should be equal to two.
  • For hash collisions, do not use a linked list, save the values ​​in a table to improve cache performance.
  • Using a state machine, you can get a very fast multi-threaded implementation. In his blog entry, he lists the states on the destination machine, but due to license issues, he does not provide the source code.
+3
source

All Articles