I need to do a comparison between two XML documents. I looked at a lot of different xml-diffing tools that are usually mentioned here in Stack Overflow, but my needs are, of course, very strange, and therefore they are not suitable. In short, I need to compare not the documents as a whole, but the content of the elements (taking into account the order), and I need a very specific output format, and not the traditional diff patch.
Please excuse this volume of text, but it’s hard for me to explain it shorter.
Firstly my limitations
The solution must be Java based or integrate with the Java command line application. It should also be free, because I am not allowed to spend “real money” on it, only my working time (but not too much, of course, I have a deadline hanging over me) ... does it sound familiar? Finally, my goal is not the traditional result of the diff patch, but not a simple combination of both source files.
Secondly, a description of my data
Each document contains nodes of type text or section ; texts are simple lines, but sections can contain both text and other sections (they also have a name given as an attribute). In addition, each node is tagged with revision information.
Here is a sample document. Note that for brevity it looks like a list; in fact, it is more prosaic - that is, the order of the elements is very important.
<document diff="=" revision="1"> <text diff="=" revision="1">Apples</text> <text diff="=" revision="1">Chxrries</text> <section diff="=" revision="1" name="Blue ones"> <text diff="=" revision="1">Grapes</text> <section diff="=" revision="1" name="More"> <text diff="=" revision="1">Blueberries</text> </section> <text diff="=" revision="1">Oranges</text> </section> </document>
This needs to be compared with the new version, which contains changes, but does not contain revision information (for now!). In this example, I fixed a typo in the 2nd element, and I moved another element, but there can be much more extensive changes, such as adding or removing whole sections.
<document> <text>Apples</text> <text>Oranges</text> <text>Cherries</text> <section name="Blue ones"> <text>Grapes</text> <section name="More"> <text>Blueberries</text> </section> </section> </document>
The goal is to create a third XML document with all the information. Note that the diff tags of the affected elements have been changed ("*" represents the change within the element), and their revision numbers have been thrown over; fixed elements retain their previous revision information.
<document diff="*" revision="2"> <text diff="=" revision="1">Apples</text> <text diff="+" revision="2">Oranges</text> <text diff="-" revision="2">Chxrries</text> <text diff="+" revision="2">Cherries</text> <sectio diff="*" revision="1"n name="Blue ones"> <text diff="=" revision="1">Grapes</text> <section diff="=" revision="1" name="More"> <text diff="=" revision="1">Blueberries</text> </section> <text diff="-" revision="2">Oranges</text> </section> </document>
As a result, the result is not a diff patch, but a complete document with updated version information.
Thirdly, I have a job - and my problem
I have most of this work using a custom java function that performs a linear comparison - except that it is not performed in one specific use case, namely when the old version contains certain text more than once and the latest changes in the new version. This will “trick” the comparator into matching the text of the old version with the next text of the new version, instead of recognizing a one-text change for what it is. Although the result is technically correct, the added “noise” of unnecessary additions and abstractions mask this fact, and for people it's just a mess to look at (and, by the way, this markup is intended for human reading). Now, precisely because of my phased approach, I find it very difficult to fix.
Here is an example of a use case that deceives my code. First, a simple fruit basket:
<document diff="=" revision="1"> <text diff="=" revision="1">Apples</text> <text diff="=" revision="1">Oranges</text> <text diff="=" revision="1">Apples</text> <text diff="=" revision="1">Cherries</text> <text diff="=" revision="1">Apples</text> </document>
Now change the second Apples element:
<document> <text>Apples</text> <text>Oranges</text> <text>Bananas</text> <--- I've only changed this <text>Cherries</text> <text>Apples</text> <text>Grapes</text> </document>
The result, incorrectly, will look like this:
<document diff="*" revision="2"> <text diff="=" revision="1">Apples</text> <text diff="=" revision="1">Oranges</text> <text diff="+" revision="2">Bananas</text> <--- Addition, okay <text diff="+" revision="2">Cherries</text> <--- Incorrectly added <text diff="=" revision="1">Apples</text> <--- Incorrectly matches the next occurrence <text diff="-" revision="2">Cherries</text> <--- Incorrectly removed <text diff="-" revision="2">Apples</text> <--- Incorrectly removed <text diff="=" revision="1">Grapes</text> <--- Back on track, after the next occurrence of the changed element </document>
True, I could probably mitigate this problem, but realize some form of search, but I could not say how far to look into the future, and therefore it sounds like a very dirty job, not a true solution.
... therefore, in conclusion, I desperately need an xml diff tool that allows me to analyze the contents of the data and create this very specific result. Either this, or any advice on how I could avoid this particular error.
If you have any suggestions or questions for development, I really want to hear from you.
This is a reiteration of the previous question . Unfortunately, I cannot offer any awards to advertise it, but hopefully my new explanation here will be better.
What is it for, here is my algorithm, which does not seem to be listed on the DiffAlgorithm page that @LarsH is linked to
Compare the two lists: name them lL and lR for the left and right side. Create two “primary” pointers iL and iR and set them in the first elements of each list. For the loop, use these basic pointers to set the primary elements eL and eR, so eL = lL (iL) and eR = lR (iR). Compare eL and eR. If eL corresponds to eR, we can copy eL into the result as a match and advance both main pointers to one slot. If eL and eR do not match, create a secondary pointer (iR2), initialize its slot after iR (iR2 = iR + 1) and scan the rest of lR (setting eR2 = lR (iR2) when we go). If eL does not match in the rest of lR, eL must be deleted, and we can add eL to the result as deleting and advancing only the main pointer iL (so the following comparison compares the next eL with the same eR). If eL is found to match eR2 (at position iR2> iR), then all elements in the range [iR, iR2 [should be added. Then we can add each element in the lR range to the result as a complement, and set iR = iR2. We can also add the eL element to the result as a match (because it was matched on eR2), and finally repeat the comparison with the new main position indicator. Do it all, iterate over the shorter of the two lists; then add the remainder of lL as deletions or add the remainder of lR as additions.