Finding an algorithm for diff text that detects and can group similar strings

In the process of writing a text tool to compare two similar source code files.

There are many such diff tools, but mine will improve a bit:

If he finds that the set of lines is incompatible on both sides (i.e. in both files), he should not only highlight these lines, but also highlight individual changes in these lines (I call this line-to-line comparison here).

An example of my somewhat working solution:

alt text http://files.tempel.org/tmp/diff_example.png

He currently performs a set of inappropriate lines and starts his single characters again through diff algo, creating a pink highlight.

However, the second set of inconsistencies containing “original 2” requires more work: the first two correct lines were added here (“added line a / b”), and the third line is a modified version on the left side. I want my software to detect this difference between a probable change and a probable new line.

When looking at this simple example, I find this case pretty easily:

With an algo such as Levenshtein, I could find that out of all the correct lines in the set from 3 to 5, line 5 corresponds to the left line 3, so I could subtract that lines 3 and 4 on the right were added, and do an interline comparison on the left line 3 and right line 5.

So far so good. But I still stick with how to turn this into a more general algorithm for this purpose.

In a more complex situation, a set of different lines could add lines on both sides, with several closely spaced lines between them. This is pretty tricky:

I would need to combine not only the first line on the left, but also the other way around, and vice versa, and so on with all the other lines. Basically, I have to match every row on the left with every one on the right. In the worst case, this can even create intersections, so it will not be easy to figure out which rows were recently inserted and which were just changed (Note: I do not want to deal with possible moved lines in such a block, unless this simplifies the algorithm).

Of course, this will never be perfect, but I'm trying to get it better than now. Any suggestions that are not too theoretical, but rather practical (I am not good at abstract algos), we appreciate.

Update

I have to admit that I don’t even understand how the LCS algorithm works. I just feed it with two arrays of strings, and the output is a list of those sequences that do not match. I mainly use the code from here: http://www.incava.org/projects/java/java-diff

Looking at the code, I find one equal () function that is responsible for telling the algorithm whether the two lines match or not. Based on what Paul suggested, I wonder where I would make a change. But how? This function returns only a boolean, not a relative value, which could determine the quality of the match. And I can't just use Levenshtein’s fixed norm, which would decide whether such a line would still be considered equal or not. I need something self-adjusting to the whole set of rows in question.

So, I basically say that I still do not understand where I would apply a fuzzy value related to the relative similarity of strings that do not match (exactly).

+6
algorithm diff text levenshtein distance
source share
3 answers

With an algo such as Levenshtein, I could find that out of all the correct lines in the set from 3 to 5, line 5 corresponds to the left line 3, so I could subtract that lines 3 and 4 on the right were added, and do an interline comparison on the left line 3 and right line 5.

Once you have identified it, use the same algorithm to determine which lines in these two slots correspond to each other. But you need to do a little modification. When you used the algorithm to match equal lines, the rows could either match or not match, so you added either 0 or 1 to the cell in the table that you used.

When comparing strings in one part, some of them are "more equal" than others (ack. Orwell). Thus, they can add a real number from 0 to 1 in the cell, considering which sequence is best suited so far.

To calculate these indicators (from 0 to 1), you can apply the same algorithm to each pair of lines that you come across ... correctly (in fact, you already did this when you did the first pass of the Levenshtein algorithm). This will calculate the length of the LCS, the ratio of which to the average length of two lines will be the metric value.

Or you can take the algorithm from one of the diff tools. For example, vimdiff can highlight the matches you need.

0
source share

Levenshtein distance is based on the concept of an "edit script" that converts one line to another. It is very closely related to the Needleman-Wunsch algorithm , which is used to align DNA sequences by inserting whitespace characters in which we look for alignment that maximizes the score in O (nm) time using dynamic programming. Exact matches between characters increase the score, while discrepancies or inserted spaces decrease the score. Example alignment of AACTTGCCA and AATGCGAT :

 AACTTGCCA- AA-T-GCGAT (6 matches, 1 mismatch, 3 gap characters, 3 gap regions) 

We may think that the top line is the “initial” sequence, which we transform into the “final” sequence below. Each column containing a space character - at the bottom - is a deletion, each column with - top is an insert, and each column with different (non-dislocated) characters is a replacement. In the alignment above there are 2 deletions, 1 insert and 1 replacement, so the Levenshtein distance is 4.

Here is another alignment of the same lines with the same Levenshtein distance:

 AACTTGCCA- AA--TGCGAT (6 matches, 1 mismatch, 3 gap characters, 2 gap regions) 

But note that although there are the same number of spaces, there is another area of ​​clearance. Since biological processes are more likely to create larger gaps than multiple gaps, biologists prefer this alignment - just like the users of your program. This is achieved by also punishing the number of gaps in the estimates that we calculate. The O (nm) algorithm for doing this for strings of length n and m was given by Gotoh in 1982 in an article entitled "Improved Algorithm for Matching Biological Sequences." Unfortunately, I cannot find any links to the free full text of the article, but there are many useful guides that you can find for the search query “sequence alignment” and “affinity gap penalty”.

In general, different options for matching weights, inconsistencies, tearing and scatter of areas will produce different alignments, but any negative result for the clearance areas will be preferable to lower alignment above upper.

What does all this have to do with your problem? If you use the Gotoh algorithm for individual characters with a suitable space penalty (with a few empirical tests), you should find a significant reduction in the number of terrible looking alignments, such as the example you gave.

efficiency considerations

Ideally, you can simply do this on characters and ignore lines altogether, since affine punishment will work to group changes into blocks spanning many lines, wherever they are. But due to the higher runtime, it may be more realistic to make the first pass along the lines, and then repeat the algorithm on characters, using as input all lines that are not identical. In accordance with this scheme, any shared block of identical lines can be processed by compressing it into one “symbol” with a bloated matching weight, which helps to guarantee the absence of “transitions”.

+3
source share

Here is one possible solution that someone else made me understand:

My original approach was this:

  • Divide the text into separate lines and use the LCS algorithm to determine where there are blocks of asymmetric lines.
  • Use some smart algorithm (in question) to find out which of these lines closely matches, i.e. to say that these lines have been changed between versions.
  • Compare those close to each other, line by line, using LCS again, marking the inconsistent lines as completely new.

Although this will improve the visual display of changes when comparing source versions, I now find that a simple approach is usually sufficient. It works as follows:

  • Same as above.
  • Take the right and left block of asymmetric lines, connect these lines and mark them (either in language pointers / words, or simply in separate characters).
  • Apply the LCS algorithm on two token arrays.

Perhaps those who answered my initial question suggested that I do this all the time, but I focused so much on the comparison in each line that I did not have to apply LCS on many lines combining them, rather than processing them in turn.

So, although this approach will not provide detailed information about the changes in the quality of my initial intention, it still improves the results compared to what I started yesterday when I wrote this question.

I will leave this question open for some time - maybe someone else, having read all this, will still be able to give a complete answer (Pavel and random_hacker offered some suggestions, but this is not the final solution - anyway, thanks for the useful comments )

0
source share

All Articles