Is it possible that SequenceMatcher in Python difflib can provide a more efficient way to calculate Levenshtein distance?

Question

Is it possible that SequenceMatcher in Python difflib can provide a more efficient way to calculate Levenshtein distance?

Here is an example of a general algorithm tutorial for calculating Levenshtein distance (I pulled Magnus Hetland from the website ):

def levenshtein(a,b): "Calculates the Levenshtein distance between a and b." n, m = len(a), len(b) if n > m: # Make sure n <= m, to use O(min(n,m)) space a,b = b,a n,m = m,n current = range(n+1) for i in range(1,m+1): previous, current = current, [i]+[0]*n for j in range(1,n+1): add, delete = previous[j]+1, current[j-1]+1 change = previous[j-1] if a[j-1] != b[i-1]: change = change + 1 current[j] = min(add, delete, change) return current[n]

I was wondering, however, if there was a more efficient (and possibly more elegant) pure Python implementation that uses the difflib SequenceManager. After playing with him, this is what I came up with:

 from difflib import SequenceMatcher as sm def lev_using_difflib(s1, s2): a = b = size = distance = 0 for m in sm(a=s1, b=s2).get_matching_blocks(): distance += max(ma-a, mb-b) - size a, b, size = m return distance

I can not come up with a test case where it fails, and the performance seems much better than the standard algorithm.

Here are the results with the Levenshtein algorithm, which relies on difflib:

 >>> from timeit import Timer >>> setup = """ ... from difflib import SequenceMatcher as sm ... ... def lev_using_difflib(s1, s2): ... a = b = size = distance = 0 ... for m in sm(a=s1, b=s2).get_matching_blocks(): ... distance += max(ma-a, mb-b) - size ... a, b, size = m ... return distance ... ... strings = [('sunday','saturday'), ... ('fitting','babysitting'), ... ('rosettacode','raisethysword')] ... """ >>> stmt = """ ... for s in strings: ... lev_using_difflib(*s) ... """ >>> Timer(stmt, setup).timeit(100000) 36.989389181137085

And here is the standard implementation of pure python:

 >>> from timeit import Timer >>> setup2 = """ ... def levenshtein(a,b): ... n, m = len(a), len(b) ... if n > m: ... a,b = b,a ... n,m = m,n ... ... current = range(n+1) ... for i in range(1,m+1): ... previous, current = current, [i]+[0]*n ... for j in range(1,n+1): ... add, delete = previous[j]+1, current[j-1]+1 ... change = previous[j-1] ... if a[j-1] != b[i-1]: ... change = change + 1 ... current[j] = min(add, delete, change) ... ... return current[n] ... ... strings = [('sunday','saturday'), ... ('fitting','babysitting'), ... ('rosettacode','raisethysword')] ... """ >>> stmt2 = """ ... for s in strings: ... levenshtein(*s) ... """ >>> Timer(stmt2, setup2).timeit(100000) 55.594768047332764

Is algorithm performance using difflib SequenceMatcher really better? Or does he rely on the C library, which completely cancels the comparison? If it relies on C extensions, how can I tell by looking at the difflib.py implementation?

Using Python 2.7.3 [GCC 4.2.1 (Apple Inc., build 5666)]

Thanks in advance for your help!

+7

performance python algorithm levenshtein distance difflib

damzam 30 sept '12 at 6:30

source share

1 answer

Gareth rees · Accepted Answer · 2012-10-02T20:35:38+0000

 >>> levenshtein('hello', 'world') 4 >>> lev_using_difflib('hello', 'world') 5

Is it possible that SequenceMatcher in Python difflib can provide a more efficient way to calculate Levenshtein distance?

More articles: