Compare two documents using regex

Question

Compare two documents using regex

I want to compare two documents regardless of line breaks. If the contents of the same, but the position and the number of line breaks are different, I want to match the lines in one document with the lines in another.

Given:

Document 1

I went to Paris in July 15, where I met some nice people. And I came back to NY in Aug 15. I am planning to go there soon after I finish what I do.

Document 2

 I went to Paris in July 15, where I met some nice people. And I came back to NY in Aug 15. I am planning to go there soon after I finish what I do.

I want an algorithm capable of determining that line 1 in document 1 contains the same text as lines 1 through 5 in document 2, that lines 2 and 3 in document 1 contain the same text as line 6 in the document 2, etc.

 1 = 1,2,3,4,5 2,3 = 6 4,5,6 = 7,8

Is there a way with regular expression to match every line in every document if it spans multiple lines in other documents?

+7

python algorithm regex

hmghaly Feb 01 '13 at 18:33

source share

3 answers

Terry li · Answer 1 · 2013-02-01T19:44:03+0000

 import java.io.File; import java.io.IOException; import java.util.ArrayList; import org.apache.commons.io.FileUtils; public class Compare { public static void main(String[] args) throws IOException { String doc1 = FileUtils.readFileToString(new File("Doc1.txt")); String doc2 = FileUtils.readFileToString(new File("Doc2.txt")); String[] array1 = doc1.split("\n"); String[] array2 = doc2.split("\n"); int[] count1 = new int[array1.length]; int[] count2 = new int[array2.length]; int sum1 = 0; int sum2 = 0; for (int i=0;i<count1.length;i++) { count1[i] = sum1 + array1[i].split(" ").length; sum1 = count1[i]; } for (int i=0;i<count2.length;i++) { count2[i] = sum2 + array2[i].split(" ").length; sum2 = count2[i]; } ArrayList<Integer> result1 = new ArrayList<Integer>(); ArrayList<Integer> result2 = new ArrayList<Integer>(); for (int j=0; j<count1.length; ) { for (int k=0; k<count2.length; ) { if (count1[j]==count2[k]) { result1.add(j+1); result2.add(k+1); System.out.println(result1.toString()+" = "+result2.toString()); result1 = new ArrayList<Integer>(); result2 = new ArrayList<Integer>(); j++;k++; } else if (count1[j]>count2[k]) { result2.add(k+1); k++; } else { result1.add(j+1); j++; } } } } }

Output Example:

 [1] = [1, 2, 3, 4, 5] [2, 3] = [6] [4, 5, 6] = [7, 8]

Complete and working Java code. This is not a regular expression solution, so it may not suit your needs.

The idea is that we create an array for each document. The size of the array is equal to the number of lines in each document. The nth element of the array stores the number of words seen before the nth line of the document. Then we identify those equal elements in both arrays whose indices determine the output ranges.

Jdb · Answer 2 · 2013-02-01T19:23:32+0000

I am not a python programmer, but this does not seem to be a problem that can be solved with regex.

Instead, you first want to compare documents to make sure the content is the same (temporarily delete all new lines in advance). I do not know what you want to do if this is not so, therefore I am not going to address this.

Create a collection of entire collections called linemappings

Begin the cycle. The cycle will go through each character in each document at the same time. You will need four counters. charindex1 will contain the current character index in Document 1 and charindex2 will contain the current charater index in Document 2. lineindex1 will contain the current line index in Document 1 and lineindex2 will contain the current line index in Document 2.

Start with char index variables up to 0 and row index variables initialized to 1.

Start:
Get the current character from each document: char1 from document 1 and char2 from document 2.
If char1 AND char2 are BOTH newlines or NO newlines, then advance both charindex1 and charindex2 by 1.
Else If char1 is a newline, then forward charindex1 by 1.
Else If char2 is a newline, then forward charindex2 by 1.
If EITHER char1 or char2 is a new line, then insert the new entry in the linemappings collection (the result at the end will be something like [[1,1],[1,2],[1,3],[1,4],[1,5],[2,6],[3,6],[4,7],[5,7],[6,7],[6,8] )
If char1 is a new line, advance lineindex1 to 1.
If char2 is a new line, advance lineindex2 by 1.
Loop until end of input is reached.

(I could not verify this, since I am not a python programmer, but I hope you get the gist and can change it to suit your needs.)

Samantha · Answer 3 · 2013-02-01T18:59:43+0000

You can iterate over each line of doc1 and do something like this:

searchstring = line.replace(' ', '[ |\n]')

Then do a search on doc2 using this search string.

match = re.search(searchstring, contents)

If match is NULL , then there was no match. Else, match.group(0) will provide you with the relevant contents of the doc 2 document.

'I went\nto Paris\nin July 15,\nwhere I met\nsome nice people.'

Then this is a simple splitting exercise, which is at '\ n' and figuring out which lines in doc2 they came from.

Compare two documents using regex

More articles: