How to determine if a webpage has been modified

I have snapshots of several web pages taken 2 times. What is a reliable method for determining which web pages have been modified?

I cannot rely on something like an RSS feed, and I need to ignore slight noise, such as date text.

Ideally, I'm looking for a Python solution, but the intuitive algorithm would also be great.

Thanks!

+6
python diff snapshot webpage
source share
4 answers

Well, first you need to decide what noise is and what not. You can use an HTML parser like BeautifulSoup to remove noise, print the result and compare it as a string.

If you are looking for an automatic solution, you can use difflib.SequenceMatcher to calculate the difference between pages, calculate the similarity and compare it with a threshold.

+8
source share

The decision really depends if you are cleaning a specific site or trying to create a program that will work on any site.

You can see which areas change frequently, doing something like this:

  diff <(curl http://stackoverflow.com/questions/) <(sleep 15; curl http://stackoverflow.com/questions/) 

If you are worried about only one site, you can create several sed expressions to filter out material, such as timestamps. You can repeat until the difference for small fields is shown.

The general problem is much more complicated, and I would suggest comparing the total number of words on the page for starters.

+3
source share

Something like Levenshtein Distance might come in handy if you set the threshold for changes to a distance that ignores the right amount of noise for you.

0
source share

just take snapshots of the files using MD5 or SHA1 ... if the values โ€‹โ€‹are different the next time you check, then they will change.

-one
source share

All Articles