Scrapy spider that only scans URLs once

I am writing a Scrapy spider that scans a set of URLs once a day. However, some of these websites are very large, so I can’t crawl the full site every day and don’t want to generate the massive traffic necessary for this.

An old question ( here ) asked something like that. However, the above answer simply points to a code snippet ( here ) that seems to require something from the request instance, although this is not explained in the answer or on the page containing the code snippet.

I am trying to figure this out, but finding the middleware a bit confusing. A complete example of a scraper that can run several times without a duplicate URL would be very useful, regardless of whether it uses related middleware.

I wrote the code below to get the ball, but I don't have to use this middleware. Any scrapy spider that can crawl and retrieve new URLs daily will do. Obviously, one solution is to simply write a dictionary of scraper URLs and then check that each new URL is / is not in the dictionary, but this seems very slow / inefficient.

Spider

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from cnn_scrapy.items import NewspaperItem



class NewspaperSpider(CrawlSpider):
    name = "newspaper"
    allowed_domains = ["cnn.com"]
    start_urls = [
        "http://www.cnn.com/"
    ]

    rules = (
        Rule(LinkExtractor(), callback="parse_item", follow=True),
    )

    def parse_item(self, response):
        self.log("Scraping: " + response.url)
        item = NewspaperItem()
        item["url"] = response.url
        yield item

Items

import scrapy


class NewspaperItem(scrapy.Item):
    url = scrapy.Field()
    visit_id = scrapy.Field()
    visit_status = scrapy.Field()

Middlewares (ignore.py)

from scrapy import log
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.request import request_fingerprint

from cnn_scrapy.items import NewspaperItem

class IgnoreVisitedItems(object):
    """Middleware to ignore re-visiting item pages if they were already visited
    before. The requests to be filtered by have a meta['filter_visited'] flag
    enabled and optionally define an id to use for identifying them, which
    defaults the request fingerprint, although you'd want to use the item id,
    if you already have it beforehand to make it more robust.
    """

    FILTER_VISITED = 'filter_visited'
    VISITED_ID = 'visited_id'
    CONTEXT_KEY = 'visited_ids'

    def process_spider_output(self, response, result, spider):
        context = getattr(spider, 'context', {})
        visited_ids = context.setdefault(self.CONTEXT_KEY, {})
        ret = []
        for x in result:
            visited = False
            if isinstance(x, Request):
                if self.FILTER_VISITED in x.meta:
                    visit_id = self._visited_id(x)
                    if visit_id in visited_ids:
                        log.msg("Ignoring already visited: %s" % x.url,
                                level=log.INFO, spider=spider)
                        visited = True
            elif isinstance(x, BaseItem):
                visit_id = self._visited_id(response.request)
                if visit_id:
                    visited_ids[visit_id] = True
                    x['visit_id'] = visit_id
                    x['visit_status'] = 'new'
            if visited:
                ret.append(NewspaperItem(visit_id=visit_id, visit_status='old'))
            else:
                ret.append(x)
        return ret

    def _visited_id(self, request):
        return request.meta.get(self.VISITED_ID) or request_fingerprint(request)
+4
1

, , - , /croned . dupflier.middleware , ... , , , WAYY .

, , , , CNN URL- , ?

, , RSS- CNN , , OS...

, / scrapinghub scrapinghubs python api client

, xmlspider rssspider ... , db ""... , , /-

, , , , .

0

All Articles