This is less, "how can I use them?" and more "when / why should I use them?" type of question.
EDIT: This question is almost a duplicate of this question , which suggests using Middleware Middleware to filter such requests. Updated my question below to reflect this.
In the Scrapy CrawlSpider Documentation, the rules accept two called process_links and process_request files (the documentation below is for easier reference).
By default, Scrapy filters duplicate URLs, but I'm looking for additional filtering of requests, because I get duplicate pages that have several different URLs associated with them. Such things as
URL1 = "http://example.com/somePage.php?id=XYZ&otherParam=fluffyKittens" URL2 = "http://example.com/somePage.php?id=XYZ&otherParam=scruffyPuppies"
However, these URLs will have a similar element in the query string, as shown above, this is id .
I think it would be wise to use the process_links called by my spider to filter out repeated requests.
Questions:
- Is there any reason why
process_request better suited for this task? - If not, can you give an example of when
process_request will be more applicable? - Is download middleware more suitable than
process_links or process_request ? If so, can you give an example of when process_links or process_request would be the best solution?
Documentation:
process_links is the called or string (in this case, the method from the spider object with this name will be used) that will be called for each list of links retrieved from each response using the specified link_extractor. It is mainly used for filtering.
process_request is a callable or a string (in this case, a method from a spider object with this name) that will be called with every request retrieved by this rule and must return a request or None (to filter the request).
source share