There are two (slightly) orthogonal questions.
Although the hash function is obviously important, in general, you separate the backend design from the hash function design:
- the hash function depends on the data to be stored
- The backend is dependent on storage requirements.
For hash functions, I would suggest reading CityHash or MurmurHash (with description in SO ).
For internal use, there are various problems, as you noted. Some notes:
- Are we talking about medium or worst difficulty? Without perfect hashing, achieving O (1) is almost impossible, as far as I know, although the frequency and complexity of the worst case can be significantly weakened.
- Are we talking about amortized complexity? The amortized complexity as a whole provides the best throughput due to bursts. Linear switching, due to a slightly lower bandwidth, will give you a smoother curve.
- As for multithreading, please note that the Read / Write pattern may affect the solution, given the extreme cases: 1 manufacturer and 99 readers are very different from 99 manufacturers and 1 reader. In general, records are more difficult to parallelize, since they may require a change in structure. In the worst case, they may require serialization.
- The garbage collection is rather trivial in the amortized case, with linear processing it is a bit more complicated, but probably the least difficult part.
You never talked about the amount of data you intend to use. Writers can update different buckets without interfering with each other, so if you have a lot of data, you can try to distribute them to avoid competition.
Literature:
- Wikipedia article reveals many different implementations, always good to look at the variety
- This GoogleTalk by Dr Cliff (Azul Systems) shows a hash table designed for highly multi-threaded Java systems.
source share