Which is more suitable: hash table or BST in terms of serialization, concurrency?

I think a hash table would be more appropriate if we were to resolve conflicts using a chain, because for any read or write operation we would have to get a lock on this record (index and value) in the hash table, d need to get lock for the entire BST to make any updates.

It seems to me that we need to lock the entire BST structure, because imagine that we need to insert a new node into the tree, we must first intersect in order to reach the correct parent position (say node A) and if we do not acquire the lock, the tree structure can change and we will need to start over.

In the case of hash tables, the input will always be the hash in the same position, and we would know which index to block, which is unknown in the case of BST.

Please correct me wherever I am wrong, and help me find the right answer.

PS: This is an interview question with Amazon.

+4
source share
2 answers

I think that in terms of concurrency, what you said is correct, and a hash table would be a better choice, but in terms of serialization, I think BST would be better. This is due to the fact that in BST we will only have data nodes, however in the hash table we will have both key pairs and values โ€‹โ€‹for the data, as well as several keys for which there will be no value. Thus, serialization compared to BST will require more space.

0
source

You can make a hybrid solution. Make a hash table where each slot is a binary tree. Say 1024 slots, each with a binary tree. Collisions on the hash go to the binary tree, no need to block everything on the insert, just the tree that needs to be updated.

In addition, if you use atomic operations and some kind of trick, you can make it even more parallel, avoiding full blocking. Atomic update to insert a new node, if atomic update shuts down, another thread adds node, just loop again until you succeed.

I already did this https://github.com/exodist/GSD/tree/master/Structures

This is a highly competitive hash / tree hybrid. It uses atomic operations, not mutexes. It can rotate on the insert, but it never blocks reading or updating existing keys. It can be changed / balanced / etc. You can move all entries, etc. Manages its own memory and has callbacks for ref-counting references / values. You can also โ€œbindโ€ keys in one dictionary to another, so updating the key in the first changes the value of the second.

This project really increases productivity when you throw more problems when asking a problem:

(T) - how many threads are ignored by MI, slots - how many hash slots are used, and Ops - the number of elements that are inserted (divided by threads, to see how much each thread will be executed)

./performance.run T, MI, Slots, Ops: Insert Rebalance Update Resize Lookup Immute 4, 16, 1024, 5000000: 2.765500363 0.915232065 2.540659476 2.172654813 2.545409560 2.089993698 13.029449975 4, 16, 32768, 5000000: 2.122459866 1.403548309 2.413483374 1.885083901 2.092272466 2.643681518 12.560529434 4, 16, 1048576, 5000000: 1.700994809 1.063704010 2.030809367 2.457275707 1.453413924 3.671012425 12.377210242 16, 16, 1024, 5000000: 0.785675781 2.311281904 1.805610753 0.621521146 0.549546473 0.744009412 6.817645469 16, 16, 32768, 5000000: 0.497359638 0.316017553 1.257663142 0.610761414 0.390849355 0.825944608 3.898595710 16, 16, 1048576, 5000000: 0.328308989 0.647632801 1.267230625 1.139402693 0.342399827 1.189220470 4.914195405 64, 16, 1024, 5000000: 0.129407132 0.767262021 2.631929019 0.157977313 0.103848004 0.177964574 3.968388063 64, 16, 32768, 5000000: 0.087656606 0.068330231 1.365794852 0.166261966 0.079112728 0.203542885 1.970699268 64, 16, 1048576, 5000000: 0.074605680 0.284322979 1.372998607 0.650503349 0.084956938 0.828653807 3.296041360 

Notes: Single launch on i7 with 8 GB of RAM. Uses the atomic gcc built-in kernels in the transition from the old __sync to __atomic, which may help performance due to memory models.

0
source

All Articles