Search for a prefix in the radix / patricia trie tree

I am currently implementing a radix / patricia trie tree (whatever you want to name). I want to use it to search for prefixes in a dictionary on very limited hardware. It should work more or less as autocomplete, i.e. e. showing a list of words that matches the typed prefix.

My implementation is based on in this article , but this code does not contain a prefix search, although the author says:

[...] Suppose you want to list all nodes with keys with a common prefix "AB". You can do your first depth search, starting from this root, stopping whenever you encounter back edges.

But I do not see how this should work. For example, if I build a base tree from these words:

disease
imaginary
imagination
imagine
imitation
immediate
immediately
huge
in

I get the same "best match" for the "i" and "in" prefixes, so it's hard for me to collect all the matching words just by going through the tree with the best match.

In addition, there is an implementation of the base tree in Java , which has an implemented prefix search in RadixTreeImpl.java . This code explicitly checks all nodes (starting with a specific node) to match the prefix - it actually compares bytes.

Can someone please give me a detailed description of the implementation of prefix search on the founding trees? Is the algorithm used in a Java implementation the only way to do this?

+5
c ++ algorithm prefix patricia-trie
source share
4 answers

Think about what your trie encodes. On each node, you have a path that leads you to that node, so in your example, you start with & Lambda; (which is the capital of Lambda, this Greek font sucks) root node corresponding to an empty string. & Lambda; has children for each letter used, so in your dataset you have one branch for "i".

  • & Lambda;
  • & Lambda; β†’ "I"

There are two children in the "i" node: one for "m" and one for "n". The next letter is "n", so you accept this,

  • & Lambda; β†’ "I" β†’ "p"

and since the only word that begins with "i", "n" in your dataset is "located", there are no children from "n". That's a coincidence.

Now, say, the data set, instead of having an in, had an infindibulum. (What I find in SF remains as an exercise.) You still get to the β€œn” node in the same way, but then if the next letter you get is β€œq”, you know that the word doesn ' t appears in your dataset at all because there is no "q" branch. At this point, you say "okay, there is no coincidence." (Perhaps you will then start adding a word, perhaps not, depending on the application.)

But if the next letter is "f," you can keep going. You can short circuit this with a small number of ships: if you reach a node that represents a unique path, you can hang the entire line with node. When you get to this node, you know that the rest of the line should be "findibulum", so you used the prefix to match the whole line and returned it.

How do you use it? in many interpreters without UNIX commands, such as the old VAX DCL, you can use any unique command prefix. So, the equivalent of ls (1) was DIRECTORY , but no other command started with DIR, so you could type DIR , and that was as good as the whole word. If you could not remember the correct command, you can enter only "D" and press (I think) ESC; CLI DCL will return to you all the commands started with D that it can search very quickly.

+8
source share

It turns out that the GNU extensions for standard C ++ lib include an implementation of Patricia trie. It is being discovered as part of an expansion of policy-based data structures. See http://gcc.gnu.org/onlinedocs/libstdc++/ext/pb_ds/trie_based_containers.html

+3
source share

Alternative Algorithm: Keep It Simple Stupid!

Just make a sorted list of your keywords. When you have a prefix, binary search to find where that prefix will be in the list. All your possible completions will be found starting from this index, ready for access on the spot.

This algorithm will require only 5% of the Patricia trie code and will be easy to maintain, understand and update. It is almost certain that this simple list search will be more efficient.

The only drawback is that if you have a huge number of long keywords with similar prefixes, trie can save some storage, since for each record you do not need to save the full prefix. In practice, if you have fewer than a few million words, this is not a saving because the pointer over the head of the tree will dominate. This savings is greater for applications such as searching for DNA string databases with millions of characters, rather than text keywords.

0
source share

Another alternative alg is a triple search tree (more efficient with memory) https://github.com/varunpant/TernaryTree/tree/master/TernaryTree

0
source share

All Articles