Very fast retrieval of fuzzy rows from the database

I have a database of ~ 150,000 words and a pattern (any one word), and I want to get the words all from a database whose Damerau-Levenshtein distance between it and the pattern is less than the given number. I need to do this very quickly . What algorithm could you suggest? If there is no good algorithm for the Damerau-Levenshtein distance, then simply the Levenshtin distance will also be welcome.

Thank you for your help.

PS I will not use SOUNDEX.

+4
source share
5 answers

I would start with an SQL function to calculate the Levenshtein distance (in T-SQl or .Net) (yes, I am a MS person ...) with the maximum distance parameter, which will lead to early termination.

You can then use this function to compare your input with each line to check for distanve and move on to the next one if it violates the threshold.

I also thought that you could, for example, set the maximum distance to 2, and then filter out all words that are longer than 1, while the first letter is different. With an index, it can be a little faster.

You can also use shortcuts to return all rows that are perfect matches (indexing will speed this up), since it actually takes longer to calculate Levenshtein distance 0.

Just thoughts...

+2
source

I don't think you can compute this function without actually listing all the lines.
So the solutions are:

  • Make it a very fast listing (but it really doesn't scale)
  • Filter the original options somehow (pointer by letter, at least with common letters)
  • Use an alternative (indexable) algorithm such as N-grams (however, I have no details about the quality of the ngrams result compared to the DL distance).
+1
source

The solution from the top of my head might be to store the database in a sorted set (e.g. std::set in C ++), since it seems to me that strings sorted by lexicography will compare well. To approximate the position of a given line in set , use std::upper_bound in the line, then std::upper_bound outward from the found position in both directions, calculating the distance along the way and stop when it falls below a certain threshold. I have a feeling that this solution will probably only match lines with the same start character, but if you use a spellchecking algorithm, then this restriction is general, or at least not surprising.

Change However, if you are looking for optimization of the algorithm itself, this answer does not matter.

0
source

I used KNIME for fuzzy string matching and got very fast results. It is also very easy to create visual workflows in it. Just install the free version of KNIME from https://www.knime.org/ , then use the "String Distance" and "Search Search" nodes to get your results. I attached here a small working interface with a fuzzy match (the input comes from the top, and the search patterns come from the bottom in this case): enter image description here

0
source

I would recommend watching Ankiro .

I am not sure if it meets your accuracy requirements, but it is fast.

-one
source

All Articles