Compare text and get the difference

Question

Compare text and get the difference

Well, I want to compare 2 lines (version 1 and second version) and get the differences in a format that I can convert to html myself, for example, you can see how the message was edited here on the overflow stack or how svn tracks differences between versions. ...

It must be a complete library of managed code.

Like this JavaScript, but I need to do this on the server side.

+7

.net vb.net asp.net

Peter Jul 18 '11 at 11:10

source share

2 answers

Google has something similar and available in C #, but did not look at it deeper. The demo looks pretty cool though.

http://code.google.com/p/google-diff-match-patch/

+8

Remy Jul 18 '11 at 11:30

source share

Lasse Vågsæther Karlsen · Accepted Answer · 2011-07-18T11:25:05+0000

I have a class library that does this, I will post the link below, but I will also post how it does its work so that you can evaluate whether it will fit your content.

Please note that for everything that I will say below, if you think of each character as an element of the collection, you can implement the algorithm described below for any type of content. Whether it is characters of a line, a line of text, a collection of ORM objects.

The whole algorithm revolves around longest-common-substring (LCS) and is a recursive approach.

First, the algorithm tries to find the LCS between them. This will be the longest section that does not change / is identical between the two versions. The algorithm then considers these two parts to be “aligned.”

For example, here, as two lines will be aligned:

This long text has some text in the middle that will be found by LCS This extra long text has some text in the middle that should be found by LCS ^-------- longest common substring --------^

Then it recursively applies itself to the parts in front of the aligned sector, and then to the part.

The final “result” might look like this (I use an underscore to indicate the “no” part in one of the lines):

 This ______long text has some text in the middle that ______will be found by LCS This extra long text has some text in the middle that should____ be found by LCS

Then, as part of the recursive approach, each level of the recursive call returns a collection of “operations”, which is based on whether the LCS or missing parts in any part will spit out as follows:

If LCS, then this is a “copy” operation
If the first one is missing, then this is the "paste" operation
If the second one is missing, then this is the “delete” operation

So the above text will look like this:

Copy 5 characters ( This )
Insert extra_ (apparently, in this case, the blocks of code remove the space, the underline is the space)
Copy 43 characters ( long text has some text in the middle that_ )
Paste should
Delete 4 characters ( will )
Copy 16 characters ( _be found by LCS )

The core of the algorithm is quite simple, and with the above text you should be able to implement it yourself if you want.

My class library has some additional functions, in particular, for processing things like content that looks like changed text, so that you not only get delete or insert operations, but also change operations, this will be mainly important if you are comparing a list of something, such as lines from text files.

The class library can be found here: DiffLib in CodePlex , and you will also find it in Nuget for easy installation in Visual Studio 2010. It is written in C # for .NET 3.5 and higher, so it will work for .NET 3.5 and 4.0, and since it is a binary version (all source codes are on CodePlex), you can use it also with VB.NET.

Compare text and get the difference

More articles: