What is the best (word or symbol) algorithm based on differences?

Question

What is the best (word or symbol) algorithm based on differences?

So, I want to find the difference between the two lines based on each word (perhaps faster than for each character, although if each character is faster, then I would like to do it this way).

Here is an example of what I want to achieve: Source:

Hello there!

Modified text:

 Helay scere?

Diff:

 Hel[lo](ay) [th](sc)ere[!](?)

parenthesized text is what was deleted, copied text is what was added.

There is a kind of super hacker way to do this using a command line tool such as opendiff , but this requires a newline between each character, since opendiff is line-based.

I use ruby and did not find any tools for this ... but the language is not very important, as the algorithms can be easily ported.

thanks.

+8

string algorithm merge ruby diff

NullVoxPopuli Dec 05 '11 at 20:54

source share

5 answers

You can check this out: http://en.wikipedia.org/wiki/Longest_common_subsequence_problem . It is not difficult to implement.

+2

Victor moroz Dec 05 '11 at 21:53

source share

So you can reuse LCS (as indicated above) to find all the regular lines and remove them from both lines, replacing them with another line - let it just say "*". Then you iterate over both lines at the same time, and then combine the common and distinct images again.

Example

 A) Hello there! B) Helay scere? LCS detection gives us ["Hel"," ","ere"], and after replacement we have A) *lo*th*! B) *ay*sc*? Now you split on the delimiter ("*") giving you A) ["lo","th","!"] B) ["ay","sc","?"]

And from here you just go to make a simple grid. Something key to note that there may be empty entries, for example, if you do this method on “Hell” and “Hel”, you will end up with

 Common LCS) ["Hel"] A) ["l"] B) [""] meaning your result will be Hel[l]()

Hope this is acceptable.

+2

Noxville Dec 6 '11 at 9:26

source share

Take a look at https://github.com/pvande/differ . This stone does what you are looking for.

+2

alex Dec 27 '13 at 10:42

source share

The solution is to find the editing distance between the lines.

0

Bhavana c Dec 6 '11 at 2:32

source share

NullVoxPopuli · Accepted Answer · 2011-12-05T21:32:14+0000

Here is the ruby stone that distinguishes the lines: http://rubydoc.info/gems/diff-lcs/1.1.3/frames

Before hands, I just did (in irb)

 require 'rubygems' require 'diff/lcs' require 'diff/lcs/array' require 'diff/lcs/string'

So, by writing logic for insertion, the built-in deleted and inserted markers become trivial due to this two-dimensional range of changes.

Although I'm not sure if this is the best way.

What is the best (word or symbol) algorithm based on differences?

More articles: