Search for (very) approximate substrings in a large database

Question

Search for (very) approximate substrings in a large database

I am trying to find long approximate substrings in a large database. For example, a query may be a substring of 1000 characters, which may differ from a Levenshtein distance of several hundred corrections. I heard that indexed q-grams can do this, but I don't know the implementation details. I also heard that Lutsenko could do this, but is the Lucene levenshtein algorithm fast enough for hundreds of editions? Perhaps something from the world of plagiarism detection? Any advice is appreciated.

+5

substring indexing information-retrieval lucene

345871345 Aug 7 '10 at 22:26

source share

2 answers

Mikos · Answer 1 · 2010-08-08T01:24:09+0000

Q-grams may be one approach, but there are others like Blast, BlastP - which are used for protein, nucleotide matches, etc.

The Simmetrics Library is an extensive collection of row-based approaches.

Yuval F · Answer 2 · 2010-08-08T11:55:35+0000

Lucene doesn't seem to be the right tool here. In addition to the wonderful Mikos offers, I heard about AGREP , FASTA and Location Sensitivity (LSH) . I believe that an effective method should first significantly reduce the search space, and only then make a more complex assessment of the remaining candidates.

Search for (very) approximate substrings in a large database

More articles: