Determine if a webpage has changed

In my python application, I have to read many web pages to collect data. To reduce HTTP requests, I would like to receive only modified pages. My problem is that my code always tells me that the pages have been changed (code 200), but actually it is not.

This is my code:

from models import mytab import re import urllib2 from wsgiref.handlers import format_date_time from datetime import datetime from time import mktime def url_change(): urls = mytab.objects.all() # this is some urls: # http://www.venere.com/it/pensioni/venezia/pensione-palazzo-guardi/#reviews # http://www.zoover.it/italia/sardegna/cala-gonone/san-francisco/hotel # http://www.orbitz.com/hotel/Italy/Venice/Palazzo_Guardi.h161844/#reviews # http://it.hotels.com/ho292636/casa-del-miele-susegana-italia/ # http://www.expedia.it/Venezia-Hotel-Palazzo-Guardi.h1040663.Hotel-Information#reviews # ... for url in urls: request = urllib2.Request(url.url) if url.last_date == None: now = datetime.now() stamp = mktime(now.timetuple()) url.last_date = format_date_time(stamp) url.save() request.add_header("If-Modified-Since", url.last_date) try: response = urllib2.urlopen(request) # Make the request # some actions now = datetime.now() stamp = mktime(now.timetuple()) url.last_date = format_date_time(stamp) url.save() except urllib2.HTTPError, err: if err.code == 304: print "nothing...." else: print "Error code:", err.code pass 

I donโ€™t understand what went wrong. Can anybody help me?

+7
source share
2 answers

The web server does not need to send the 304 header as a response when sending the If-Modified-Since header. They can send HTTP 200 and send the whole page again.

Sending "If-Modified-Since" or "If-None-Since" warns the server that you need a cached response, if available. It's like sending the โ€œAccept-Encoding: gzip, deflateโ€ header - you just tell the server that you will accept something without requiring it.

+5
source

A good way to check if site 304 is returning is to use google chromes dev tools. For example. below is an annotated example of using chrome on bls website. Keep updating, and you will see that the server continues to return 304. If you force update using Ctrl + F5 (windows), you will see that instead it returns a status code of 200.

You can use this technique in your own example to find out if the server returned 304 or if you formatted your request headers incorrectly. Sometimes a web page has a resource imported to it that does not respect If headers, and therefore it returns 200, what do you do (if any resource on the page does not return 304, the whole page will return 200), but sometimes you are only looking to a specific part of the website, and you can cheat by downloading the resource directly and bypassing the entire document.

0
source

All Articles