I am trying to find long approximate substrings in a large database. For example, a query may be a substring of 1000 characters, which may differ from a Levenshtein distance of several hundred corrections. I heard that indexed q-grams can do this, but I don't know the implementation details. I also heard that Lutsenko could do this, but is the Lucene levenshtein algorithm fast enough for hundreds of editions? Perhaps something from the world of plagiarism detection? Any advice is appreciated.
Q-grams may be one approach, but there are others like Blast, BlastP - which are used for protein, nucleotide matches, etc.
The Simmetrics Library is an extensive collection of row-based approaches.
Lucene doesn't seem to be the right tool here. In addition to the wonderful Mikos offers, I heard about AGREP , FASTA and Location Sensitivity (LSH) . I believe that an effective method should first significantly reduce the search space, and only then make a more complex assessment of the remaining candidates.