Using a binary search tree as a spellcheck

Studying the most efficient way to turn a binary search tree into a spell checker by reading, for example, a dictionary of 1000 words, and then checking another document that has a couple of paragraphs.

+4
source share
8 answers

a three-dimensional tree trie would be more efficient

+8
source

Are you configured to use a binary search tree? A Bloom filter is likely to be a more efficient data structure.

+3
source

If you are just trying to find out if a word exists in your dictionary (that is, it is spelled correctly), I do not think that the binary search tree is what you need. The best way to save this information would be in a tree style, where each next node in your tree will be a single character, and reading the path to the end of the node will give you the spelling of the word. You also need to add a marker to indicate the end of the word.

For example: let's say your dictionary contains the following words: car, cart, cat, cup, cut

- C - A - R - end - T - T - end - U - P - end - T - end 

Checking whether a word exists is to look at each letter individually and that it exists in the children of the current node.

 Check for "cat" Does "C" exist at the root level? Yes, move to the next letter. Does "A" exist underneath C? Yes, move on. Does "T" exist underneath A? Yes, move on. Is there a word ending after the T? Yes. Word exists. Check for "cu" Does "C" exist at the root level? Yes, move to the next letter. Does "U" exist at the root level? Yes, move to the next letter. Is there a word ending after the U? No. Word does not exist. 

How you store this information is up to you. As Stephen pointed out, a Trainary Search Trie might be a way: each node would have 27 possible child nodes.

+1
source

This site should help you. It has an implementation in java.

+1
source

If you need to do an automatic search / search prefix, then you should look at the patricia tree or the radix tree.

0
source

In the example that you indicated, the performance is likely to be irrelevant, since on the PC the whole operation will take about 1% of the time, when the user should read the first result that you show, provided that you are not using a completely stupid algorithm. But, nevertheless, I assume that the problem is large enough, that performance is a problem.

If the dictionary file is predefined (like most of them), and if the text is small relative to the dictionary, as you described, then I would be very tempted to sort the text, possibly remove duplicates, and then iterate over both lists side by side using the same procedure as merge sorting, except that you tell if each text word is in the dictionary instead of listing the combined list.

This task is performed in comparing M log M for sorting, plus no more than N + M comparisons for iteration (possibly less, but not complexity-less). This is close enough to optimal complexity for a one-time operation: to get rid of the linear term in N, you need to find ways to not read the entire dictionary from disk at all. I am pretty sure that this is possible for bsearch to a file, especially considering that the words are quite short, but for small N it all guesses whether finding a place in this place will be faster than sequential data access.

It has the following characteristics:

  • You do not need to store the dictionary in memory, only text.
  • However, you only make one pass over the dictionary file.
  • You do not do any expensive dictionary processing.

Of course, if the dictionary file is not pre-sorted, this will not work, and if you can keep the dictionary in memory for the next spelling check operation, you can amortize the cost of I / O and processing it into a tree through several different texts, which will be the gain in the end .

If the dictionary is really huge, then it may be useful for you to save it to disk in a pre-processed form, equivalent to an unbalanced tree, weighted in accordance with the relative frequencies of different words in your language. Then you can do less than O (N) disk access for small texts, and on most operating systems do not load it into memory at all, just mmap file and let the OS worry about it. For a large dictionary, all clusters containing words starting with "dimethyl" should never be touched.

Another consideration is the splay tree for the dictionary. The cleavage tree will unbalance itself as you look in it to quickly find frequently used values. In most cases, the text is repeated several times, so if the text is long enough to justify the overhead, it will ultimately win.

Both of the above obey Stephen Lowe, which for strings, trie is superior to a normal tree. I don’t know if you can find a ready-made splay trie.

0
source

Seeing that this is a homework question, I assume that you need to use a plain old binary tree (no Red-Black trees, AVL trees, Radix trees, etc.). Then the answer is to try to keep the tree balanced when you create it from a list of words. One approach is to randomize the list before reading it, which gives reasonable results. But you can get better results if you order an input sequence (using the same comparison as the tree) and then recursively split the input returning the midpoint until there are no elements left. The result is a balanced tree.

I knocked down three different ways to do this in C #:

 private static IEnumerable<T> BinaryTreeOrder<T>(IList<T> range, int first, int last) { if (first > last) { yield break; } int mid = (first + last) / 2; yield return range[mid]; foreach (var item in BinaryTreeOrder(range, first, mid - 1)) { yield return item; } foreach (var item in BinaryTreeOrder(range, mid + 1, last)) { yield return item; } } private static void BinaryTreeOrder<T>(IList<T> range, int first, int last, ref IList<T> outList) { if (first > last) { return; } int mid = (first + last) / 2; outList.Add(range[mid]); BinaryTreeOrder(range, first, mid - 1, ref outList); BinaryTreeOrder(range, mid + 1, last, ref outList); } private static void BinaryTreeOrder<T>(IList<T> range, int first, int last, ref BinaryTree<T> tree) where T : IComparable<T> { if (first > last) { return; } int mid = (first + last) / 2; tree.Add(range[mid]); BinaryTreeOrder(range, first, mid - 1, ref tree); BinaryTreeOrder(range, mid + 1, last, ref tree); } 
0
source

As expected, trie will be more efficient than a binary tree, but you can use the hash map and hash of each word. You have a small dictionary (1000 entries). As you go through your document, check to see if the words are in the hash map. If this is not the case, the word is considered erroneous.

This will not give you a possible correction of a misspelled word. He just tells you yes or no (right or wrong).

If you want to use spelling sentences for incorrect words, you can start with the word in the file, then generate all the words 1 edit the distance and add them as children of the original word. So you are building a graph. Go 2 levels deep for maximum speed and accuracy. If you create the word node, which is in the dictionary, you can add it to the list of possible sentences. At the end, return the list of possible offers.

For a better spell check, also try adding phonetic matching.

sea ​​yuh β†’ see yah

This method (create line graphs 1 will edit) is "slow". But this is a good academic exercise. Runtime - O (n ^ branches).

If the link here is interested in one, I built myself (for fun): https://github.com/eamocanu/spellcheck.graph

Some examples of graphs: https://github.com/eamocanu/spellcheck.graph/tree/master/graph%20photos

I also added to it a user interface component that generates graphs. This is an external library.

0
source

All Articles