I have two files (f1 and f2) containing some text (or binary data).How to quickly find common blocks?
eg.f1: ABC DEFf2: XXABC XEF
output:
common blocks:length 4: "ABC" at f1 @ 0 and f2 @ 2 length 2: "EF" at f1 @ 5 and f2 @ 8
Wikipedia has pseudo-code for finding the longest common substring between two data sequences. In your case, you simply retrieve the entire common substring from the table, which is not a prefix of other common substrings (i.e., Maximum common substrings).
: http://sourceforge.net/projects/duplo/
The open source PMD project has a cut and paste detection module, which is listed on this page: http://pmd.sourceforge.net/integrations.html .