In the example that you indicated, the performance is likely to be irrelevant, since on the PC the whole operation will take about 1% of the time, when the user should read the first result that you show, provided that you are not using a completely stupid algorithm. But, nevertheless, I assume that the problem is large enough, that performance is a problem.
If the dictionary file is predefined (like most of them), and if the text is small relative to the dictionary, as you described, then I would be very tempted to sort the text, possibly remove duplicates, and then iterate over both lists side by side using the same procedure as merge sorting, except that you tell if each text word is in the dictionary instead of listing the combined list.
This task is performed in comparing M log M for sorting, plus no more than N + M comparisons for iteration (possibly less, but not complexity-less). This is close enough to optimal complexity for a one-time operation: to get rid of the linear term in N, you need to find ways to not read the entire dictionary from disk at all. I am pretty sure that this is possible for bsearch to a file, especially considering that the words are quite short, but for small N it all guesses whether finding a place in this place will be faster than sequential data access.
It has the following characteristics:
- You do not need to store the dictionary in memory, only text.
- However, you only make one pass over the dictionary file.
- You do not do any expensive dictionary processing.
Of course, if the dictionary file is not pre-sorted, this will not work, and if you can keep the dictionary in memory for the next spelling check operation, you can amortize the cost of I / O and processing it into a tree through several different texts, which will be the gain in the end .
If the dictionary is really huge, then it may be useful for you to save it to disk in a pre-processed form, equivalent to an unbalanced tree, weighted in accordance with the relative frequencies of different words in your language. Then you can do less than O (N) disk access for small texts, and on most operating systems do not load it into memory at all, just mmap file and let the OS worry about it. For a large dictionary, all clusters containing words starting with "dimethyl" should never be touched.
Another consideration is the splay tree for the dictionary. The cleavage tree will unbalance itself as you look in it to quickly find frequently used values. In most cases, the text is repeated several times, so if the text is long enough to justify the overhead, it will ultimately win.
Both of the above obey Stephen Lowe, which for strings, trie is superior to a normal tree. I donβt know if you can find a ready-made splay trie.