As js441 noted, Apache Lucene is a good option, but only if you intend to do a term-based search, similar to how Google works. If you need to search for arbitrary strings that cover the terms Lucene, this will not help you.
In a later case, you are right, you need to build some kind of suffix tree. The clean trick you can do after you create the suffix tree is to write it to a file and paste it into memory space. This way you will not waste memory storing the entire tree in RAM, but you will often access parts of the tree that are automatically cached. The disadvantage of mmap is that the initial search may be somewhat slow. Also it will not be if your files change frequently.
To cope with the search for files that you just edited, you can save two indexes, one for most of your files, and the other only for recently edited files. Therefore, when you perform a search, you will search in both indicators. Periodically, you must rebuild the constant index with the contents of the new files and replace the old one.
Here are some examples of when Lucene is good and when the suffix tree is good:
Suppose you have a document that contains the following:
A quick brown dog jumped over a lazy fox.
Lucene is suitable for the following searches:
- fast
- quick brown
- d *
q * b
With some tricks, you can perform the following searches:
'* ick * own'
This type of search will work very slowly.
'q * ick brown d * g'
And this type of search will never find anything
"ick brown d"
Lucene is also good when you process your documents as word bags. This way you can easily search, for example,
Fast fox
Which will find you all documents in which words are fast and fox, regardless of what is in the middle.
Suffix trees, on the other hand, work well for finding exact matches of substrings within a document, even when your search spans terms and starts and ends in the middle of a term.
A very good algorithm for constructing suffix trees of large arrays is described here (Warnign paywalled).
Vlad
source share