Faster than C # (or other .NET). Levanstein.

Good night,

I have been working with a fuzzy string sequence for some time, and using C with some pointers, I could write a very fast (for my needs) implementation of the Levenshtein distance between two lines. I tried porting the code to C # using unsafe code and the fixed keyword, but the performance was slower. So I decided to build a C ++ dll and use [DllImport] from C #, automatically sorting each line. The problem is that after profiling this remains the most time-consuming part of my program, occupying 50-57% of the total program time. Since I think that I will need to do the hard work with many substrings of the text field, based on some 3 millionth database entries, I think that the time taken by Levenshtein distance is almost unacceptable. I would like to know if you have any suggestions, both algorithmic and programming related, with the code below, or if you know any better algorithm for calculating this distance?

 #define Inicio1 (*(BufferVar)) #define Inicio2 (*(BufferVar+1)) #define Fim1 (*(BufferVar+2)) #define Fim2 (*(BufferVar+3)) #define IndLinha (*(BufferVar+4)) #define IndCol (*(BufferVar+5)) #define CompLinha (*(BufferVar+6)) #define TamTmp (*(BufferVar+7)) int __DistanciaEdicao (char * Termo1, char * Termo2, int TamTermo1, int TamTermo2, int * BufferTab, int * BufferVar) { *(BufferVar) = *(BufferVar + 1) = 0; *(BufferVar + 2) = TamTermo1 - 1; *(BufferVar + 3) = TamTermo2 - 1; while ((Inicio1 <= *(BufferVar + 2)) && (Inicio2 <= *(BufferVar + 3)) && *(Termo1 + Inicio1) == *(Termo2 + Inicio2)) Inicio1 = ++Inicio2; if (Inicio2 > Fim2) return (Fim1 - Inicio1 + 1); while ((Fim1 >= 0) && (Fim2 >= 0) && *(Termo1 + Fim1) == *(Termo2 + Fim2)) { Fim1--; Fim2--;} if (Inicio2 > Fim2) return (Fim1 - Inicio1 + 1); TamTermo1 = Fim1 - Inicio1 + 1; TamTermo2 = Fim2 - Inicio2 + 1; CompLinha = ((TamTermo1 > TamTermo2) ? TamTermo1 : TamTermo2) + 1; for (IndLinha = 0; IndLinha <= TamTermo2; *(BufferTab + CompLinha * IndLinha) = IndLinha++); for (IndCol = 0; IndCol <= TamTermo1; *(BufferTab + IndCol) = IndCol++); for (IndCol = 1; IndCol <= TamTermo1; IndCol++) for (IndLinha = 1; IndLinha <= TamTermo2; IndLinha++) *(BufferTab + CompLinha * IndLinha + IndCol) = ((*(Termo1 + (IndCol + Inicio1 - 1)) == *(Termo2 + (IndLinha + Inicio2 - 1))) ? *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)) : ((*(BufferTab + CompLinha * (IndLinha - 1) + IndCol) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) : ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)))) + 1); return *(BufferTab + CompLinha * TamTermo2 + TamTermo1); } 

Please note that BufferVar and BufferTab are two external int * (in the case when int[] variables are selected from C #), which I do not create in every function call to speed up the whole process. However, this code is pretty slow for my needs. Can someone please give me some suggestions or, if possible, suggest a better code?

Edit: The distance cannot be limited, I need the actual distance.

Thank you very much,

+4
source share
3 answers

1. Brute force

Here is a Levenshtein distance implementation in Python.

 def levenshtein_matrix(lhs, rhs): def move(index): return (index+1)%2 m = len(lhs) n = len(rhs) states = [range(n+1), [0,]*(n+1)] previous = 0 current = 1 for i in range(1, m+1): states[current][0] = i for j in range(1,n+1): add = states[current][j-1] + 1 sub = states[previous][j] + 1 repl = states[previous][j-1] + abs(cmp(lhs[i-1], rhs[j-1])) states[current][j] = min( repl, min(add,sub) ) previous = move(previous) current = move(current) return states[previous][n] 

This is a typical dynamic programming algorithm, just taking advantage of the fact that since only the last line is needed, it is enough to save only two lines at a time.

For implementation in C ++, you can look at LLVM one (line 70-130), pay attention to the use of a dedicated array stack of a fixed size, replaced only if necessary by a dynamically allocated array.

I just can't keep track of your code to try to diagnose it ... so let it change the angle of attack. Instead of micro-optimizing the distance, we completely change the algorithm.

2. Improvement: using a dictionary

One of the challenges you face is that you can do much better.

The first observation is that the distance is symmetrical, although it does not change the total complexity, which will halve the required time.

Secondly, since you really have a dictionary of famous words, you can rely on this: "actor" and "actual" use a common prefix ("action"), and therefore you do not need to recompile the first steps.

This can be used with Trie (or any other sorted structure) to store your words. Then you take one word and calculate its distance relative to all the words stored in the dictionary using prefixes.

Take the example dic = ["actor", "actual", "addict", "atchoum"] , and we want to calculate the distance for word = "atchoum" (we will remove it from the dictionary at this point)

  • Initialize the matrix for the word "atchoum": matrix = [[0, 1, 2, 3, 4, 5, 6, 7]]
  • Choose the next word "actor"
  • Prefix = "a", matrix = [[0, 1, 2, 3, 4, 5, 6, 7], [1, 0, 1, 2, 3, 4, 5, 6]]
  • Prefix = "ac", matrix = [[0, 1, 2, 3, 4, 5, 6, 7], [1, 0, 1, 2, 3, 4, 5, 6], [2, 1, 1, 2, 3, 4, 5, 6]]
  • Prefix = "act", matrix = [[..], [..], [..], [..]]
  • Continue to the "actor", you have a distance
  • Select the next word "relevant", rewind the matrix until the prefix becomes the prefix of our word, here to "act"
  • Prefix = "act", matrix = [[..], [..], [..], [..], [..]]
  • Continue until "actual"
  • Keep using other words

The rewinding step is important here, saving the calculations performed for the previous word, with which you share a prefix of good length, you effectively save a lot of work.

Note that this is trivially implemented using a simple stack and does not require any recursive call.

+5
source

First try a simple approach - don't use pointers and unsafe code - just plain plain ordinary C # code ... but use the correct algorithm.

There is a simple and efficient Wikipedia algorithm that uses dynamic programming and runs O (n * m), where n and m are the lengths of the inputs. I suggest that you first try to implement this algorithm, since it is described there and only begins to optimize it after you have implemented it, measured its performance and found that this is not enough.

See also the Possible Improvements section for:

By studying diagonals instead of lines and using a lazy estimate, we can find the Levenshtein distance in O (m (1 + d)) time (where d is the Levenshtein distance), which is much faster than the regular dynamics programming algorithm if the distance is small

If I had to guess where the problem is, I would probably start by looking at this line, which runs inside two loops:

*(BufferTab + CompLinha * IndLinha + IndCol) = ((*(Termo1 + (IndCol + Inicio1 - 1)) == *(Termo2 + (IndLinha + Inicio2 - 1))) ? *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)) : ((*(BufferTab + CompLinha * (IndLinha - 1) + IndCol) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) : ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)))) + 1);

There seems to be a duplicate of a lot , although I find it difficult to pinpoint what is happening. Could you influence this? And you definitely need to make it more readable.

+4
source

You should not try to use all possible words with the help of Alveortime Levenshtein. You should use another faster metric to filter out the likely candidates and only then use Levenshtein to eliminate the ambiguity. The first sieve can be based on the n-gram (often trigram) frequency histograms or hash functions.

+2
source

All Articles