I agree, in most cases, a text search engine is the way to go ... very easy to build and reliable. A little more detailed information here: most engines do OR search by default, so you need to specify that you want to combine all words.
If you need to create your own solution, yes, obviously, you need to build comparisons. I would use a hash search rather than a tree index, but your tree does not seem to be too large, so there is only a slight performance improvement. However, I see no reason to use a tree, you do not need traversal functions, you will never look for the previous or next word.
More interesting details come up when you really check how you will use your data structure. Take an example of a search: The pony he comes . Intuitively, you did not start the search with the , perhaps all documents contain it (provided that they are English texts). pony is a good choice, and you can easily narrow your search. Most text search engines contain a metric for this: how many documents this particular word contains. Therefore, based on this, you start with the least frequent and check the words in order of increasing frequency.
As soon as you manage to narrow your search, you begin to realize that your index does not work very well ... you still have the word the to check, and in your index that displays a million documents, so this point would be better to use reverse mapping, from document to words (again, hash search or trie). You check several documents to see if they contain the remaining words.
Note. There are many solutions (how to save the display, simple or double matching, btree / hash / trie / ...) depends on the scale of the projects. Abundantly you create something simple if you need to search in several files and create something else if you need to index all files on github or look for a sequence of genes where even the index may not fit into memory ...
source share