A quick substring search algorithm that will be used as an IDE with tens of thousands of very large files

I am developing something similar to an IDE that will process tens of thousands of very large (text) files, and I am studying what the current state of the topic is.

As an example, the Intellij search algorithm for standard (non-regular) expressions is pretty close. How do they achieve this? Do they just save some kind of suffix tree of all files with the ability to search in memory? Do they store a significant portion of the contents of the file in memory, so they just execute standard KMP almost completely in memory to avoid any disk I / O?

thanks

+7
java algorithm intellij-idea
source share
3 answers

Currently, IntelliJ IDEA indexes the files in the project and remembers which 3 grams (sequences of 3 letters or numbers) are found in the files. When searching, it breaks the query into 3 grams, retrieves files from an index that contains all these trigrams, intersects these sets, and uses a relatively simple text search in each of these files to check whether they really contain the entire search string.

+7
source share

You can watch Apache Lucene . This is a text search library written entirely in java. It may be a little too hard for your use, but since it is open source, you can see how it works.

It has a demo that allows you to create an index and search through the source code of the library, which sounds about the same as what you want to do.

Also, take a look at the Boyer-Moore string search algorithm. This is apparently commonly used in applications that offer ctrl + f style document searches. It includes preprocessing the search term so that it can perform as few comparisons as possible.

+1
source share

As js441 noted, Apache Lucene is a good option, but only if you intend to do a term-based search, similar to how Google works. If you need to search for arbitrary strings that cover the terms Lucene, this will not help you.

In a later case, you are right, you need to build some kind of suffix tree. The clean trick you can do after you create the suffix tree is to write it to a file and paste it into memory space. This way you will not waste memory storing the entire tree in RAM, but you will often access parts of the tree that are automatically cached. The disadvantage of mmap is that the initial search may be somewhat slow. Also it will not be if your files change frequently.

To cope with the search for files that you just edited, you can save two indexes, one for most of your files, and the other only for recently edited files. Therefore, when you perform a search, you will search in both indicators. Periodically, you must rebuild the constant index with the contents of the new files and replace the old one.

Here are some examples of when Lucene is good and when the suffix tree is good:

Suppose you have a document that contains the following:

A quick brown dog jumped over a lazy fox.

Lucene is suitable for the following searches:

  • fast
  • quick brown
  • d *
  • q * b

    With some tricks, you can perform the following searches:

  • '* ick * own'

    This type of search will work very slowly.

  • 'q * ick brown d * g'

    And this type of search will never find anything

  • "ick brown d"

    Lucene is also good when you process your documents as word bags. This way you can easily search, for example,

  • Fast fox

    Which will find you all documents in which words are fast and fox, regardless of what is in the middle.

    Suffix trees, on the other hand, work well for finding exact matches of substrings within a document, even when your search spans terms and starts and ends in the middle of a term.

    A very good algorithm for constructing suffix trees of large arrays is described here (Warnign paywalled).

0
source share

All Articles