Parallel trie building algorithm?

Since the trie data structure has such a huge branching coefficient and each subtree is completely independent of the others, there seems to be a way to significantly speed up the construction of trie for a given dictionary by adding all the words in parallel.

My initial idea on how to do this is as follows: associate the mutex with each pointer in the trie (including the pointer to the root), and then each thread follow the normal algorithm to insert the word into the trie. However, before following the pointers, the thread must first obtain a lock on this pointer so that if it needs to add a new child element of the node in the trie, it can do this without entering any data calculations.

The catch with this approach is that it uses a huge number of locks - one for each pointer in the trie - and makes a huge number of acquisitions and releases - one for each character in each input line.

Is there a way to build trie in parallel without using almost as many locks?

+8
string algorithm parallel-processing data-structures trie
source share
4 answers

The obvious blocking algorithm would be:

  • Bucket - sorting input lines with a prefix of length-k (usually, k = 1, but with small alphabets, increasing k).
  • For each letter, create a trie containing the k-suffix of all lines starting with that letter.
  • Combine the attempts from the previous step (when k = 1, just add the root of the node).

Assuming a uniform distribution of prefixes, this can lead to linear acceleration to the size of the alphabet to degree k.

+8
source share

It just occurred to me that this can be done without blocking, using atomic test and given operations on pointers instead of locks. In particular, if a thread wants to follow a pointer, it does the following:

  • Atomically reads the value of a pointer.
  • If the pointer is not null, follow it. All is ready.
  • Otherwise, select the new node.
  • Atomically test the pointer to zero and set it to the new node if it is zero.
  • (Note: the pointer here is definitely not null. Either we just set it, or it was set by another thread).
  • Follow the signpost.

Depending on the hardware, this can be much faster since it avoids locking and unlocking all the time and ensures that the thread never waits indefinitely.

One drawback is that the number of resources involved is increasing, as many threads may try to allocate a node to be placed in the trie in a specific place, but only one can put it there. Fortunately, this can be mitigated by the following optimization: if a thread ever allocates a node unnecessarily and does not immediately release it, it simply saves the node in temporary space. If later he needs to allocate a new node, he can use cached. If not, he can free him at the very end.

Hope this helps!

+4
source share

Well, there is an obvious trade-off between fine VS coarse granularity setting a lock on a set of nodes (rather than one).

An easy way to do this through hashing is there are m different locks, and for each node you want to get, you will get a lock with the hash(node) % m . Note that this approach is basically a generalization of the proposed approach (with perfect hashing and n == m ) and a sequential approach (with m == 1 ).

Another thing that can be used: optimistic design - if the approach really increases productivity, it depends on the distribution of the dictionary and the size of the trio, and can help a lot if collisions are usually very rare (which can be the case for a dictionary of very long words) . The idea is to simply add words to the trie without any synchronization, and if you are faced with a rollback from a collision to the last known stable state (this, of course, requires data capture and may not be feasible if we are talking about data streams that cannot be saved).

+1
source share

Depending on how your dictionary looks, you might not need locks at all if you can get each thread to create independent subtrees. If this is not an online algorithm, precede the words with a prefix (the first letter, if you have <26 threads, the first and second, if you have more or you know that the data is unbalanced, for example, 90% of the word starts with A). Basically, it will be an O (n) operation, where you make one pass to count the number of words that begin with a given letter, then one pass to sort (along the sort lines of your choice). Then split the prefixes between the threads and create each thread of these independent sub-ribs. Finally, one thread adds each of these subtrees to the root. I will review an example below.

Your dictionary:
Bark
Apple
Cookie
AND
Baby
Corn
Blue
cake
Bacon

After sorting:
Apple
AND
Bark
Baby
Blue
Bacon
Corn
Cookie
Cake

Then we divide the prefixes among the threads. For this example, we have 3 threads that receive the prefixes [A] [B] [C] and build the following trees:

 A - |  B ------- |  C ------- |    
 PN | - A --- |  LO --- |  A
 PDRBCUORK
 LKYOEKNE
 ENI
                                    E 

And then you have one thread combining them in the root sequence:

 | ----------- Root ------------------ |
 A - |  B ------- |  C ------- |    
 PN | - A --- |  LO --- |  A
 PDRBCUORK
 LKYOEKNE
 ENI
                                    E 

I hope this made sense.

Advantages of this method: Streams work almost independently of each other, and you do not have the overhead associated with the acquisition and release of locks.

The disadvantages of this method: If you do not know anything about the dictionary, a serious workload imbalance can occur, and in the worst case (say, all words begin with β€œA”), it returns to the fact that it is basically the thread that creates the tree. There are several ways to do it better, for example, you can add some checks when sorting, so that if the workload is very unbalanced when working with one letter prefix, use the first two letters, but in fact you can’t guarantee that it will balanced.

You can also have time streams, if you say that you have 20 threads and sorted by the first letter, then you will have 6 threads that must make two subtrees, and 14 of them sit idle for half the time. You can subdivide the subtrees again to handle this, but this is the extra time spent on the preprocessing step.

In any case, there is no guarantee that this is faster than your method, but consider something.

+1
source share

All Articles