Calculating the percentage difference between two HTML files

Question

Calculating the percentage difference between two HTML files

I am writing a tool in php that compares HTML files and shows the differences. Now I'm looking for an effective way to calculate the percentage difference between two HTML files. These files can be arbitrary (my files can be up to 300,000 characters long).

After some research, I stumbled upon the Levenshtein distance, which is an O (n * m) algorithm and requires O (n * m) space: the php version can only support up to 255 characters and my own implementation of O (n) was too slow. After that I tried the php function of similar_text, but this algorithm is too slow for very large HTML files.

So now I'm looking for another, more efficient algorithm for comparing HTML files. The approximation algorithm is also good. Can someone give me some tips on how to do this?

+4

html php diff levenshtein distance

Devos50 Mar 25 '13 at 14:28

source share

1 answer

Aziz saleh · Answer 1 · 2014-03-03T20:30:13+0000

You can configure the xdiff extension:

http://www.php.net/manual/en/function.xdiff-file-diff.php

Then get the diff of the two files, and based on this diff, you can easily get the percentage.

Example:

First file A: 400 words
Second file B: 400 words

Diff Results: 200 words diff from A to B

This will give you a 50% resemblance.

Calculating the percentage difference between two HTML files

More articles: