Fast approximate line difference for large lines

I am trying to quantify the difference between two lines as part of a change monitoring system.

The problem I ran into is that the strings are large - I can often deal with strings with 100K + characters.

I am currently using Levenshtein distance, but calculating levenshtein distance for large strings is very inefficient. Even the best implementations control only O(min(mn)).

Since both lines are approximately the same length, the process of calculating the distance can take many seconds.

I do not need high accuracy. Permission to change 1 in 1000 (e.g. 0.1%) would be great for my application.

What options are there to more efficiently calculate string lengths?

+4
source share
1 answer

If you can tolerate some error, you can try to break the lines into smaller pieces and calculate their pairwise L-distances.

The method will obviously give an accurate result for substitutions, insertions and deletions, which will be penalized for accuracy depending on the number of fragments (the worst case scenario will give you distance 2 * <number of insert/deletes> * <number of chunks>instead <number of insert/deletes>)

The next step may be the adaptation of the process, I see two ways to do this, depending on the expected nature of the changes:

  • , . , ( , ).
  • , , (, / ) .
0

All Articles