Fast dynamic Fuzzy search over 100k + lines in C #

Question

Fast dynamic Fuzzy search over 100k + lines in C #

Suppose they are preloaded with stock characters entered in the text box. I am looking for code that I can copy, not a library to install.

This was inspired by this question:

Are there libraries of fuzzy search functions or string functions written for C #?

The Levenshtein distance algorithm works well, but it takes time to calculate. Are there any optimizations around the fact that the request should be rerun when the user enters an extra letter? I am interested in showing a maximum of 10 best matches for each entry.

+7

c # algorithm .net-4.0 fuzzy-search

Hamish grubijan May 12, '11 at 1:06

source share

2 answers

Kirk Broadhurst · Answer 1 · 2011-05-12T01:44:57+0000

You need to define the appropriate rules around your lines. What defines a "similar string"

number of matching characters
number of inappropriate characters
similar length
typos or phonetic errors
business abbreviations
must start with the same substring
must end with the same substring

I did quite a bit of work with string matching algorithms, and I have not yet found any existing library or code that matches my specific requirements. Browse through them, borrow ideas from them, but you will always have to tweak and write your own code.

Levenshtein's algorithm is good, but a bit slow. I had some success with Smith-Waterman and Jaro-Winkler algorithms, but the best I found for myself was Monge (from memory). However, he pays to read the original study and determine why they wrote their algorithms and their target data set.

If you incorrectly determined what you want to combine and measure, you will find high scores for unexpected matches and low scores in expected matches. String matching is domain specific. If you do not correctly define your domain, then you, as a fisherman without a hint, throw hooks and hope for the best.

Mike sokolov · Answer 2 · 2011-10-04T01:35:42+0000

This blog post describes the work that took place in Lucene in this area. They were able to implement remote coordination of Levenshtein distance using the Finite State Transducer (automaton) quite efficiently up to an editing distance of 2. Encoding is all in Java, although a bit complicated, although it is open source.

But the basic idea is quite simple: think of your dictionary as a giant tree of letter states. In state 0 you have no letters. In state 1, you allow any letter, which may be the first letter of the word. State 2 is due to state 1; if the first letter is "x", the next state allows only letters that can follow x (at position 2). and etc.

Now, to map Levenshtein, you traverse the letter tree, making some mistakes: deletion, insertion (single-character) and possibly permutation (a good extension of Levenshtein is permutation in the form of one edit, not 2), you must maintain the state to keep track of how many rights were allowed. This can be done very efficiently - of course, fast enough for an interactive “while you're typing” talking about spelling.

Fast dynamic Fuzzy search over 100k + lines in C #

More articles: