Here is an example of a general algorithm tutorial for calculating Levenshtein distance (I pulled Magnus Hetland from the website ):
def levenshtein(a,b): "Calculates the Levenshtein distance between a and b." n, m = len(a), len(b) if n > m: # Make sure n <= m, to use O(min(n,m)) space a,b = b,a n,m = m,n current = range(n+1) for i in range(1,m+1): previous, current = current, [i]+[0]*n for j in range(1,n+1): add, delete = previous[j]+1, current[j-1]+1 change = previous[j-1] if a[j-1] != b[i-1]: change = change + 1 current[j] = min(add, delete, change) return current[n]
I was wondering, however, if there was a more efficient (and possibly more elegant) pure Python implementation that uses the difflib SequenceManager. After playing with him, this is what I came up with:
from difflib import SequenceMatcher as sm def lev_using_difflib(s1, s2): a = b = size = distance = 0 for m in sm(a=s1, b=s2).get_matching_blocks(): distance += max(ma-a, mb-b) - size a, b, size = m return distance
I can not come up with a test case where it fails, and the performance seems much better than the standard algorithm.
Here are the results with the Levenshtein algorithm, which relies on difflib:
>>> from timeit import Timer >>> setup = """ ... from difflib import SequenceMatcher as sm ... ... def lev_using_difflib(s1, s2): ... a = b = size = distance = 0 ... for m in sm(a=s1, b=s2).get_matching_blocks(): ... distance += max(ma-a, mb-b) - size ... a, b, size = m ... return distance ... ... strings = [('sunday','saturday'), ... ('fitting','babysitting'), ... ('rosettacode','raisethysword')] ... """ >>> stmt = """ ... for s in strings: ... lev_using_difflib(*s) ... """ >>> Timer(stmt, setup).timeit(100000) 36.989389181137085
And here is the standard implementation of pure python:
>>> from timeit import Timer >>> setup2 = """ ... def levenshtein(a,b): ... n, m = len(a), len(b) ... if n > m: ... a,b = b,a ... n,m = m,n ... ... current = range(n+1) ... for i in range(1,m+1): ... previous, current = current, [i]+[0]*n ... for j in range(1,n+1): ... add, delete = previous[j]+1, current[j-1]+1 ... change = previous[j-1] ... if a[j-1] != b[i-1]: ... change = change + 1 ... current[j] = min(add, delete, change) ... ... return current[n] ... ... strings = [('sunday','saturday'), ... ('fitting','babysitting'), ... ('rosettacode','raisethysword')] ... """ >>> stmt2 = """ ... for s in strings: ... levenshtein(*s) ... """ >>> Timer(stmt2, setup2).timeit(100000) 55.594768047332764
Is algorithm performance using difflib SequenceMatcher really better? Or does he rely on the C library, which completely cancels the comparison? If it relies on C extensions, how can I tell by looking at the difflib.py implementation?
Using Python 2.7.3 [GCC 4.2.1 (Apple Inc., build 5666)]
Thanks in advance for your help!