Levenshtein DFA in .NET

Good day,

Does anyone know about the "over-the-counter" Levanshtein implementation of DFA (deterministic finite state machines) in .NET (or is it easy to translate to it)? I have a very large dictionary with more than 160,000 different words, and I want, given the official word w, to find all known words at a Levenshtein distance of no more than 2 of w in an effective way.

Of course, having a function that calculates all possible changes at an editing distance, one of the given word and again applying it to each of these changes solves the problem (and quite simply). The problem is efficiency - given a 7-letter word, it may take 1 second already, and I need something much more efficient, if possible, as is the case with Levenshtein DFA, a decision that O (| w | )

Editing: I know that I can work a little on the problem, but at the moment I can not afford to read articles with a 60-page edition of Schultz and Mikhov.

Many thanks.

+6
performance finite-automata levenshtein distance automata
source share
6 answers

We implemented this for apache lucene java, maybe you can convert it to C # and save some time.

main class here: its just a builder to get Levenshtein DFAs from a string using the Schultz and Mihov algorithm.

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/LevenshteinAutomata.java

parametric descriptions (pre-computed tables) for Lev1 and Lev2 are here: http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/ Lev1ParametricDescription.java

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/Lev2ParametricDescription.java

you may have noticed that they were generated using a computer, we generated them using this script, using the implementation of wonderful mana by Jean-Phillipe Barrett (python) http://svn.apache.org/repos/asf/lucene/dev/trunk/ lucene / src / java / org / apache / lucene / util / automaton / createLevAutomata.py

we generate parametric descriptions as packed long [] arrays so that it does not make our jar file too large.

just change toAutomaton (int n) to suit your needs / DFA package. in our case, we use a modified form of the automata package for brics, where transitions are presented in the form of unified code codes.

effective unit tests are difficult for this kind of thing, but here's what we came up with ... it seems that they were thorough and even found a bug (which was immediately fixed by the author!) in the implementation of moman.

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/util/automaton/TestLevenshteinAutomata.java

+2
source share

Here you go.

/// <summary> /// Levenshtein Distance Calculator /// </summary> public static int DistanceFrom(this string s, string t) { int n = s.Length; int m = t.Length; int[,] d = new int[n + 1, m + 1]; // Step 1 if (n == 0) return m; if (m == 0) return n; // Step 2 for(int i = 0; i <= n; d[i, 0] = i++) ; for(int j = 0; j <= m; d[0, j] = j++) ; // Step 3 for (int i = 1; i <= n; i++) { //Step 4 for (int j = 1; j <= m; j++) { // Step 5 int cost = (t[j - 1] == s[i - 1]) ? 0 : 1; // Step 6 d[i, j] = Math.Min( Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost); } } // Step 7 return d[n, m]; } 
+1
source share

I put the appropriate Lucene Java code suggested by Robert Muir in C #. As for the question and “out of the box”: this is work in progress, but the code appears¹ to work and can probably be optimized2, although it really works very well.

You can find it here: https://github.com/mjvh80/LevenshteinDFA/ .

UPDATE It seems that Lucene.NET is actually not dead (yet?), And I noticed that they now have a ported version of this code. Therefore, I recommend looking there ( https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Core/Util/Automaton/LevenshteinAutomata.cs ) to implement this.


¹ code needs more tests
², because java is ported to C #, possibly also because I wrote naive replacements for some classes (for example, bitet).

+1
source share

I understand that you want to find close matches in a large dictionary. This is how I do it. .

From what I can understand about DFA, I don’t see how it is better or even completely different, under the skin. NFAs may be faster, but that is because they do not exist. Maybe I'm wrong.

0
source share

Nick Johnson has a very detailed post about building a Levenshtein automaton in Python, and the code is here . This is a good read, and I used a slightly modified version of the code that I found effective.

Mike Danlawi's answer is also good. I wonder what is the most effective in this case, searching for trie or Levenshtein DFA?

-one
source share

I would like to note that at the moment, Levenshtein Automaton implementations in Lucene and Lucene.Net use files containing parametric state tables (abstract state tables that describe specific states in the automaton) created using Moman .

If you want a solution capable of creating such tables from scratch in memory, you might need to take a look at LevenshteinAutomaton . It is in Java, but it is well structured, easy to use and widely commented on, and as such should be easier to port to C # than the current Lucene implementation. It is also supported by moi.

* Fun fact: I filed LevenshteinAutomaton as a replacement or as a link to replace the current Levenshthein Automaton implementation in Lucene ... 3 years ago .

-one
source share

All Articles