Faster

I am working on various large binaries. I implemented the famous Myers Diff algorithm, which gives minimal diff. However, this is O (ND), so in order to separate two very different 1 MB files, I expect it to take 1 million squares = 1 trillion. This is not good!

What I would like is an algorithm that creates a potentially not minimal diff, but does it much faster. I know that it must exist because Beyond Compare does this. But I dont know how!

Of course: There are tools like xdelta or bdiff, but they create a patch designed to consume computers, which is different from the difference consumed by humans. The patch focuses on converting one file to another, so it can do things like copying from previous parts of a file. Differences in human costs allow you to visually show the differences and can only insert and delete. For example, this conversion:

"puddi" โ†’ "puddipuddipuddi"

will create a small patch with โ€œcopy [0.4] to [5.9] and [10, 14]โ€, but with a big difference โ€œappend" puddipuddi '. "I'm interested in algorithms that create a higher diff.

Thanks!

+7
source share
2 answers

Diffing is basically the same algorithm used in bioinformatics to align DNA sequences. These sequences are often large (millions or billions of nucleotides long), and one strategy that works well there in longer genomes is used by the MUMmer program:

  • Quickly find all Maximum unique matches (substrings that appear in both files and which cannot be expanded in any direction while maintaining this condition) using the suffix tree
  • Quickly find the longest MUM subset that is displayed in sequential order in both files using the dynamic programming algorithm with the highest increase-subsequence
  • Correct this subset of MUM in alignment (i.e. mark these areas as appropriate)
  • If deemed necessary, perform more slowly (e.g. Myers), varying across MUM regions. In your case, you would probably omit this step completely if you found that the length of the longest MUM was below a certain threshold (which you could make proof that these 2 files are not connected).

This usually gives a very good (although not guaranteed-optimal) set of aligned areas (or, whatโ€™s the same, a very small set of differences) when there are not many differences. I am not sure of the exact time limits for each step, but I know that there are no n^2 or higher terms.

I believe that the MUMmer program requires a DNA or protein sequence, so it may not work out of the box for you, but the concepts certainly apply to common strings (for example, files), so if you are ready to redefine it yourself, I would recommend this an approach.

+4
source

In terms of performance, as the file size increases, GNU Diffutils is probably the most reliable option. For your situation, I would probably use side by side a comparative format , which is probably the most human friendly. In addition, you do not draw a conclusion in a different format and do some work to make it beautiful.

A good opponent, whose work has been steadily improving, including numerous accelerations, diff-match-patch . It implements the Myers Diff algorithm in several different languages, including Java and JavaScript. See an online demo for an example of the latter with fairly printed results. If you want the line to be different, check out the wiki for tips on how to use it for this purpose.

+1
source

All Articles