I am trying to quantify the difference between two lines as part of a change monitoring system.
The problem I ran into is that the strings are large - I can often deal with strings with 100K + characters.
I am currently using Levenshtein distance, but calculating levenshtein distance for large strings is very inefficient. Even the best implementations control only O(min(mn)).
Since both lines are approximately the same length, the process of calculating the distance can take many seconds.
I do not need high accuracy. Permission to change 1 in 1000 (e.g. 0.1%) would be great for my application.
What options are there to more efficiently calculate string lengths?
source
share