Scrapy: CrawlSpider Rules process_links vs process_request vs download middleware

This is less, "how can I use them?" and more "when / why should I use them?" type of question.

EDIT: This question is almost a duplicate of this question , which suggests using Middleware Middleware to filter such requests. Updated my question below to reflect this.

In the Scrapy CrawlSpider Documentation, the rules accept two called process_links and process_request files (the documentation below is for easier reference).

By default, Scrapy filters duplicate URLs, but I'm looking for additional filtering of requests, because I get duplicate pages that have several different URLs associated with them. Such things as

 URL1 = "http://example.com/somePage.php?id=XYZ&otherParam=fluffyKittens" URL2 = "http://example.com/somePage.php?id=XYZ&otherParam=scruffyPuppies" 

However, these URLs will have a similar element in the query string, as shown above, this is id .

I think it would be wise to use the process_links called by my spider to filter out repeated requests.

Questions:

  • Is there any reason why process_request better suited for this task?
  • If not, can you give an example of when process_request will be more applicable?
  • Is download middleware more suitable than process_links or process_request ? If so, can you give an example of when process_links or process_request would be the best solution?

Documentation:

process_links is the called or string (in this case, the method from the spider object with this name will be used) that will be called for each list of links retrieved from each response using the specified link_extractor. It is mainly used for filtering.

process_request is a callable or a string (in this case, a method from a spider object with this name) that will be called with every request retrieved by this rule and must return a request or None (to filter the request).

+4
source share
1 answer
  • No, process_links is your best bet as you just filter the URLs and save the overhead of creating a Request in process_request just to remove it.

  • process_request is useful if you want to massage Request bit before sending it, say if you want to add meta argument or maybe add or remove headers.

  • you do not need any middleware in your case, because the necessary functionality is built directly into Rule . If process_links were not built into the rules, you need to create your own middleware.

+9
source

All Articles