I have a class library that does this, I will post the link below, but I will also post how it does its work so that you can evaluate whether it will fit your content.
Please note that for everything that I will say below, if you think of each character as an element of the collection, you can implement the algorithm described below for any type of content. Whether it is characters of a line, a line of text, a collection of ORM objects.
The whole algorithm revolves around longest-common-substring (LCS) and is a recursive approach.
First, the algorithm tries to find the LCS between them. This will be the longest section that does not change / is identical between the two versions. The algorithm then considers these two parts to be “aligned.”
For example, here, as two lines will be aligned:
This long text has some text in the middle that will be found by LCS This extra long text has some text in the middle that should be found by LCS ^-------- longest common substring --------^
Then it recursively applies itself to the parts in front of the aligned sector, and then to the part.
The final “result” might look like this (I use an underscore to indicate the “no” part in one of the lines):
This ______long text has some text in the middle that ______will be found by LCS This extra long text has some text in the middle that should____ be found by LCS
Then, as part of the recursive approach, each level of the recursive call returns a collection of “operations”, which is based on whether the LCS or missing parts in any part will spit out as follows:
- If LCS, then this is a “copy” operation
- If the first one is missing, then this is the "paste" operation
- If the second one is missing, then this is the “delete” operation
So the above text will look like this:
- Copy 5 characters (
This ) - Insert
extra_ (apparently, in this case, the blocks of code remove the space, the underline is the space) - Copy 43 characters (
long text has some text in the middle that_ ) - Paste
should - Delete 4 characters (
will ) - Copy 16 characters (
_be found by LCS )
The core of the algorithm is quite simple, and with the above text you should be able to implement it yourself if you want.
My class library has some additional functions, in particular, for processing things like content that looks like changed text, so that you not only get delete or insert operations, but also change operations, this will be mainly important if you are comparing a list of something, such as lines from text files.
The class library can be found here: DiffLib in CodePlex , and you will also find it in Nuget for easy installation in Visual Studio 2010. It is written in C # for .NET 3.5 and higher, so it will work for .NET 3.5 and 4.0, and since it is a binary version (all source codes are on CodePlex), you can use it also with VB.NET.