I am writing a tool in php that compares HTML files and shows the differences. Now I'm looking for an effective way to calculate the percentage difference between two HTML files. These files can be arbitrary (my files can be up to 300,000 characters long).
After some research, I stumbled upon the Levenshtein distance, which is an O (n * m) algorithm and requires O (n * m) space: the php version can only support up to 255 characters and my own implementation of O (n) was too slow. After that I tried the php function of similar_text, but this algorithm is too slow for very large HTML files.
So now I'm looking for another, more efficient algorithm for comparing HTML files. The approximation algorithm is also good. Can someone give me some tips on how to do this?
source share