The Haskell tail recursion question for Levenshtein distances

I play with Levenshtein distance calculation in Haskell, and I'm a little disappointed with the following performance issue. If you implement it as โ€œnormalโ€ for Haskell, for example, below (dist), everything works fine:

dist :: (Ord a) => [a] -> [a] -> Int dist s1 s2 = ldist s1 s2 (L.length s1, L.length s2) ldist :: (Ord a) => [a] -> [a] -> (Int, Int) -> Int ldist _ _ (0, 0) = 0 ldist _ _ (i, 0) = i ldist _ _ (0, j) = j ldist s1 s2 (i+1, j+1) = output where output | (s1!!(i)) == (s2!!(j)) = ldist s1 s2 (i, j) | otherwise = 1 + L.minimum [ldist s1 s2 (i, j) , ldist s1 s2 (i+1, j) , ldist s1 s2 (i, j+1)] 

But, if you bend your brain a little and implement it as dist ', it performs MUCH faster (about 10x).

 dist' :: (Ord a) => [a] -> [a] -> Int dist' o1 o2 = (levenDist o1 o2 [[]])!!0!!0 levenDist :: (Ord a) => [a] -> [a] -> [[Int]] -> [[Int]] levenDist s1 s2 arr@ ([[]]) = levenDist s1 s2 [[0]] levenDist s1 s2 arr@ ([]:xs) = levenDist s1 s2 ([(L.length arr) -1]:xs) levenDist s1 s2 arr@ (x:xs) = let n1 = L.length s1 n2 = L.length s2 n_i = L.length arr n_j = L.length x match | (s2!!(n_j-1) == s1!!(n_i-2)) = True | otherwise = False minCost = if match then (xs!!0)!!(n2 - n_j + 1) else L.minimum [(1 + (xs!!0)!!(n2 - n_j + 1)) , (1 + (xs!!0)!!(n2 - n_j + 0)) , (1 + (x!!0)) ] dist | (n_i > n1) && (n_j > n2) = arr | n_j > n2 = []:arr `seq` levenDist s1 s2 $ []:arr | n_i == 1 = (n_j:x):xs `seq` levenDist s1 s2 $ (n_j:x):xs | otherwise = (minCost:x):xs `seq` levenDist s1 s2 $ (minCost:x):xs in dist 

I tried all the usual seq tricks in the first version, but nothing speeds it up. This is a little unsatisfactory for me, because I expected the first version to be faster, because I do not need to evaluate the entire matrix, but only those parts that it needs.

Does anyone know if it is possible to implement these two implementations, or am I just reaping the benefits of tail recursion optimization in the latter, and so you need to live with its unreadable if I want performance?

Thanks, Orion

+6
recursion haskell tail levenshtein distance
source share
5 answers

I have not yet completed your second attempt, but, as far as I remember, the idea of โ€‹โ€‹the Levenshtein algorithm is to save repeated calculations using a matrix. In the first code snippet, you do not use any calculations, and thus you will repeat many calculations. For example, when calculating ldist s1 s2 (5,5) you will calculate for ldist s1 s2 (4,4) at least three separate times (once directly, once through ldist s1 s2 (4,5) , once through ldist s1 s2 (5,4) ).

What you need to do is define an algorithm for generating the matrix (like a list of lists, if you want). I think this is what your second code snippet does, but it seems to focus on computing the matrix from top to bottom, rather than building the matrix purely in inductive style (recursive calls in the base case are quite unusual in my opinion). Unfortunately, I donโ€™t have time to write it all out, but fortunately someone else: look at the first version at this address: http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance#Haskell

Two more things: one, I'm not sure that the Levenshtein algorithm can ever use only part of the matrix, since each record depends on the diagonal, vertical and horizontal neighbors. When you need a value for one corner, you will inevitably have to evaluate the matrix completely in another corner. Secondly, this line is match | foo = True | otherwise = False match | foo = True | otherwise = False match | foo = True | otherwise = False can be replaced simply by match = foo .

+2
source share

In the past, I used this very compressed version with foldl and scanl from Wikibooks :

 distScan :: (Ord a) => [a] -> [a] -> Int distScan sa sb = last $ foldl transform [0 .. length sa] sb where transform xs@ (x:xs') c = scanl compute (x + 1) (zip3 sa xs xs') where compute z (c', x, y) = minimum [y + 1, z + 1, x + fromEnum (c' /= c)] 

I just used this simple test using Criterion :

 test :: ([Int] -> [Int] -> Int) -> Int -> Int test fn = f up up + f up down + f up half + f down half where up = [1..n] half = [1..div n 2] down = reverse up main = let n = 20 in defaultMain [ bench "Scan" $ nf (test distScan) n , bench "Fast" $ nf (test dist') n , bench "Slow" $ nf (test dist) n ] 

And the Wikibooks version hits you both hard:

 benchmarking Scan collecting 100 samples, 51 iterations each, in estimated 683.7163 ms... mean: 137.1582 us, lb 136.9858 us, ub 137.3391 us, ci 0.950 benchmarking Fast collecting 100 samples, 11 iterations each, in estimated 732.5262 ms... mean: 660.6217 us, lb 659.3847 us, ub 661.8530 us, ci 0.950... 

Slow is still working in a couple of minutes.

+5
source share

To calculate length you need to evaluate the entire list. This is an expensive O (n) operation. And more importantly, after that the list will be stored in memory until you stop referencing the list (=> more memory). A rule of thumb should not use length in lists if lists are expected for a long time. The same thing applies to (!!) , it comes from the very head of the list every time, therefore it is also O (n). Lists are not intended for a random access data structure.

An improved approach with Haskell lists is to partially use them. Folds tend to approach similar issues. And the Levenshtein distance can be calculated this way (see Link below). I do not know if there are better algorithms.

Another approach is to use a different data structure rather than lists. For example, if you need random access, known length, etc., see Data.Sequence.Seq .

Existing Implementations

The second approach was used in this implementation of the Levenshtein distance in Haskell (using arrays). You can find the foldl implementation based on the first comment. BTW, foldl' usually better than foldl .

+3
source share

You can have an algorithm O (N * d), where d is the Levenshtein distance. Here's the Lazy ML implementation from Lloyd Ellison, who uses laziness to achieve improved complexity. This works only with the computational part of the matrix, i.e. The area around the main diagonal, proportional in width to the Levenshtein distance.

Edit: I just noticed that this was translated into haskell with a nice image showing which matrix elements are being calculated. This should be significantly faster than the above implementations when the sequences are very similar. Using the above test:

 benchmarking Scan collecting 100 samples, 100 iterations each, in estimated 1.410004 s mean: 141.8836 us, lb 141.4112 us, ub 142.5126 us, ci 0.950 benchmarking LAllison.d collecting 100 samples, 169 iterations each, in estimated 1.399984 s mean: 82.93505 us, lb 82.75058 us, ub 83.19535 us, ci 0.950 
+2
source share

A more intuitive solution using the data-memocombinators package . Credit goes to this answer . Tests are welcome, since all the solutions presented here look much slower than python-Levenshtein , which was supposedly written in C. Please note that I tried to replace character arrays instead of strings, so as not to affect.

 import Data.MemoCombinators (memo2, integral) levenshtein :: String -> String -> Int levenshtein ab = levenshtein' (length a) (length b) where levenshtein' = memo2 integral integral levenshtein'' where levenshtein'' xy -- take x characters from a and y characters from b | x==0 = y | y==0 = x | a !! (x-1) == b !! (y-1) = levenshtein' (x-1) (y-1) | otherwise = 1 + minimum [ levenshtein' (x-1) y, levenshtein' x (y-1), levenshtein' (x-1) (y-1) ] 
0
source share

All Articles