In the process of writing a text tool to compare two similar source code files.
There are many such diff tools, but mine will improve a bit:
If he finds that the set of lines is incompatible on both sides (i.e. in both files), he should not only highlight these lines, but also highlight individual changes in these lines (I call this line-to-line comparison here).
An example of my somewhat working solution:
alt text http://files.tempel.org/tmp/diff_example.png
He currently performs a set of inappropriate lines and starts his single characters again through diff algo, creating a pink highlight.
However, the second set of inconsistencies containing “original 2” requires more work: the first two correct lines were added here (“added line a / b”), and the third line is a modified version on the left side. I want my software to detect this difference between a probable change and a probable new line.
When looking at this simple example, I find this case pretty easily:
With an algo such as Levenshtein, I could find that out of all the correct lines in the set from 3 to 5, line 5 corresponds to the left line 3, so I could subtract that lines 3 and 4 on the right were added, and do an interline comparison on the left line 3 and right line 5.
So far so good. But I still stick with how to turn this into a more general algorithm for this purpose.
In a more complex situation, a set of different lines could add lines on both sides, with several closely spaced lines between them. This is pretty tricky:
I would need to combine not only the first line on the left, but also the other way around, and vice versa, and so on with all the other lines. Basically, I have to match every row on the left with every one on the right. In the worst case, this can even create intersections, so it will not be easy to figure out which rows were recently inserted and which were just changed (Note: I do not want to deal with possible moved lines in such a block, unless this simplifies the algorithm).
Of course, this will never be perfect, but I'm trying to get it better than now. Any suggestions that are not too theoretical, but rather practical (I am not good at abstract algos), we appreciate.
Update
I have to admit that I don’t even understand how the LCS algorithm works. I just feed it with two arrays of strings, and the output is a list of those sequences that do not match. I mainly use the code from here: http://www.incava.org/projects/java/java-diff
Looking at the code, I find one equal () function that is responsible for telling the algorithm whether the two lines match or not. Based on what Paul suggested, I wonder where I would make a change. But how? This function returns only a boolean, not a relative value, which could determine the quality of the match. And I can't just use Levenshtein’s fixed norm, which would decide whether such a line would still be considered equal or not. I need something self-adjusting to the whole set of rows in question.
So, I basically say that I still do not understand where I would apply a fuzzy value related to the relative similarity of strings that do not match (exactly).