I am writing a Scrapy spider that scans a set of URLs once a day. However, some of these websites are very large, so I canβt crawl the full site every day and donβt want to generate the massive traffic necessary for this.
An old question ( here ) asked something like that. However, the above answer simply points to a code snippet ( here ) that seems to require something from the request instance, although this is not explained in the answer or on the page containing the code snippet.
I am trying to figure this out, but finding the middleware a bit confusing. A complete example of a scraper that can run several times without a duplicate URL would be very useful, regardless of whether it uses related middleware.
I wrote the code below to get the ball, but I don't have to use this middleware. Any scrapy spider that can crawl and retrieve new URLs daily will do. Obviously, one solution is to simply write a dictionary of scraper URLs and then check that each new URL is / is not in the dictionary, but this seems very slow / inefficient.
Spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from cnn_scrapy.items import NewspaperItem
class NewspaperSpider(CrawlSpider):
name = "newspaper"
allowed_domains = ["cnn.com"]
start_urls = [
"http://www.cnn.com/"
]
rules = (
Rule(LinkExtractor(), callback="parse_item", follow=True),
)
def parse_item(self, response):
self.log("Scraping: " + response.url)
item = NewspaperItem()
item["url"] = response.url
yield item
Items
import scrapy
class NewspaperItem(scrapy.Item):
url = scrapy.Field()
visit_id = scrapy.Field()
visit_status = scrapy.Field()
Middlewares (ignore.py)
from scrapy import log
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.request import request_fingerprint
from cnn_scrapy.items import NewspaperItem
class IgnoreVisitedItems(object):
"""Middleware to ignore re-visiting item pages if they were already visited
before. The requests to be filtered by have a meta['filter_visited'] flag
enabled and optionally define an id to use for identifying them, which
defaults the request fingerprint, although you'd want to use the item id,
if you already have it beforehand to make it more robust.
"""
FILTER_VISITED = 'filter_visited'
VISITED_ID = 'visited_id'
CONTEXT_KEY = 'visited_ids'
def process_spider_output(self, response, result, spider):
context = getattr(spider, 'context', {})
visited_ids = context.setdefault(self.CONTEXT_KEY, {})
ret = []
for x in result:
visited = False
if isinstance(x, Request):
if self.FILTER_VISITED in x.meta:
visit_id = self._visited_id(x)
if visit_id in visited_ids:
log.msg("Ignoring already visited: %s" % x.url,
level=log.INFO, spider=spider)
visited = True
elif isinstance(x, BaseItem):
visit_id = self._visited_id(response.request)
if visit_id:
visited_ids[visit_id] = True
x['visit_id'] = visit_id
x['visit_status'] = 'new'
if visited:
ret.append(NewspaperItem(visit_id=visit_id, visit_status='old'))
else:
ret.append(x)
return ret
def _visited_id(self, request):
return request.meta.get(self.VISITED_ID) or request_fingerprint(request)