How can I use different pipelines for different spiders in one Scrapy project

I have a scrapy project that contains several spiders. Is there any way to determine which conveyors to use for some kind of spider? Not all conveyors that I have defined are applicable for each spider.

thank

+66
python web-crawler scrapy
Dec 04 '11 at 2:08
source share
9 answers

Based on the solution from Pablo Hoffman , you can use the following decorator in the process_item method of the Pipeline object so that it checks your spider's pipeline attribute to see if it needs to be executed. For example:

 def check_spider_pipeline(process_item_method): @functools.wraps(process_item_method) def wrapper(self, item, spider): # message template for debugging msg = '%%s %s pipeline step' % (self.__class__.__name__,) # if class is in the spider pipeline, then use the # process_item method normally. if self.__class__ in spider.pipeline: spider.log(msg % 'executing', level=log.DEBUG) return process_item_method(self, item, spider) # otherwise, just return the untouched item (skip this step in # the pipeline) else: spider.log(msg % 'skipping', level=log.DEBUG) return item return wrapper 

For this decorator to work correctly, the spider must have a pipeline attribute with the container of Pipeline objects that you want to use to process the element, for example:

 class MySpider(BaseSpider): pipeline = set([ pipelines.Save, pipelines.Validate, ]) def parse(self, response): # insert scrapy goodness here return item 

And then in the pipelines.py file:

 class Save(object): @check_spider_pipeline def process_item(self, item, spider): # do saving here return item class Validate(object): @check_spider_pipeline def process_item(self, item, spider): # do validating here return item 

All Pipeline objects still need to be defined in ITEM_PIPELINES in the settings (in the correct order - it would be nice to change so that the order can be specified on the Spider too).

+32
Jan 04 '13 at 22:13
source share

Just remove all pipelines from the basic settings and use this inner spider.

This will define the pipeline for the spider user

 class testSpider(InitSpider): name = 'test' custom_settings = { 'ITEM_PIPELINES': { 'app.MyPipeline': 400 } } 
+88
Jan 07 '16 at 3:53 on
source share

The other solutions given here are good, but I think they can be slow, because we are not really not using the spider pipeline, instead we check if the pipeline exists every time an element (and in some cases it can reach millions )

A good way to completely disable (or enable) a spider function with custom_setting and from_crawler for all extensions like this:

pipelines.py

 from scrapy.exceptions import NotConfigured class SomePipeline(object): def __init__(self): pass @classmethod def from_crawler(cls, crawler): if not crawler.settings.getbool('SOMEPIPELINE_ENABLED'): # if this isn't specified in settings, the pipeline will be completely disabled raise NotConfigured return cls() def process_item(self, item, spider): # change my item return item 

settings.py

 ITEM_PIPELINES = { 'myproject.pipelines.SomePipeline': 300, } SOMEPIPELINE_ENABLED = True # you could have the pipeline enabled by default 

spider1.py

 class Spider1(Spider): name = 'spider1' start_urls = ["http://example.com"] custom_settings = { 'SOMEPIPELINE_ENABLED': False } 

As you check, we specified custom_settings , which will override the things specified in settings.py , and we will disable SOMEPIPELINE_ENABLED for this spider.

Now that you run this spider, check for something like:

 [scrapy] INFO: Enabled item pipelines: [] 

Now scrapy completely turned off the conveyor, without disturbing its existence for the entire run. Make sure this also works for scrapy extensions and middlewares .

+12
Oct 30 '15 at 22:46
source share

I can think of at least four approaches:

  • Use a different scrapy project for a set of spiders + pipelines (maybe it is suitable if your spiders are quite different in different projects).
  • At the scrapy tool command line, change the pipeline settings with scrapy settings between each call to your spider
  • Isolate your spiders into your own scrapy tool commands and define default_settings['ITEM_PIPELINES'] in your class of commands in the pipeline list that you want for this command. See line 6 of this example .
  • In the process_item() pipeline classes themselves, check which spiders it works with and do nothing if it should be ignored for this spider. See the example using resources for each spider to get started. (This seems like an ugly solution because it tightly connects spiders and item conveyors. You should probably not use this one.)
+10
Dec 04 2018-11-12T00:
source share

You can use the spider name attribute in your pipeline

 class CustomPipeline(object) def process_item(self, item, spider) if spider.name == 'spider1': # do something return item return item 

Defining all pipelines in this way can accomplish what you want.

+8
Dec 27 '14 at 15:52
source share

You can simply set the parameters of the element pipeline inside the spider as follows:

 class CustomSpider(Spider): name = 'custom_spider' custom_settings = { 'ITEM_PIPELINES': { '__main__.PagePipeline': 400, '__main__.ProductPipeline': 300, }, 'CONCURRENT_REQUESTS_PER_DOMAIN': 2 } 

Then I can split the pipeline (or even use multiple pipelines) by adding a value to the loader / return element, which determines which part of the spider sent the elements. This way, I will not get any KeyError exceptions and know which elements should be available.

  ... def scrape_stuff(self, response): pageloader = PageLoader( PageItem(), response=response) pageloader.add_xpath('entire_page', '/html//text()') pageloader.add_value('item_type', 'page') yield pageloader.load_item() productloader = ProductLoader( ProductItem(), response=response) productloader.add_xpath('product_name', '//span[contains(text(), "Example")]') productloader.add_value('item_type', 'product') yield productloader.load_item() class PagePipeline: def process_item(self, item, spider): if item['item_type'] == 'product': # do product stuff if item['item_type'] == 'page': # do page stuff 
+2
Feb 02 '19 at 8:55
source share

I use two pipelines, one for loading images (MyImagesPipeline) and one for saving data in mongodb (MongoPipeline).

suppose we have a lot of spiders (spider1, spider2, ...........), in my example spider1 and spider5 cannot use MyImagesPipeline

settings.py

 ITEM_PIPELINES = {'scrapycrawler.pipelines.MyImagesPipeline' : 1,'scrapycrawler.pipelines.MongoPipeline' : 2} IMAGES_STORE = '/var/www/scrapycrawler/dowload' 

And the full complete pipeline code

 import scrapy import string import pymongo from scrapy.pipelines.images import ImagesPipeline class MyImagesPipeline(ImagesPipeline): def process_item(self, item, spider): if spider.name not in ['spider1', 'spider5']: return super(ImagesPipeline, self).process_item(item, spider) else: return item def file_path(self, request, response=None, info=None): image_name = string.split(request.url, '/')[-1] dir1 = image_name[0] dir2 = image_name[1] return dir1 + '/' + dir2 + '/' +image_name class MongoPipeline(object): collection_name = 'scrapy_items' collection_url='snapdeal_urls' def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DATABASE', 'scraping') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): #self.db[self.collection_name].insert(dict(item)) collection_name=item.get( 'collection_name', self.collection_name ) self.db[collection_name].insert(dict(item)) data = {} data['base_id'] = item['base_id'] self.db[self.collection_url].update({ 'base_id': item['base_id'] }, { '$set': { 'image_download': 1 } }, upsert=False, multi=True) return item 
0
Jun 30 '16 at 13:33
source share

we can use some conditions in the pipeline like this

  # -*- coding: utf-8 -*- from scrapy_app.items import x class SaveItemPipeline(object): def process_item(self, item, spider): if isinstance(item, x,): item.save() return item 
0
Oct 23 '18 at 9:42
source share

A simple but useful solution.

Spider code

  def parse(self, response): item = {} ... do parse stuff item['info'] = {'spider': 'Spider2'} 

conveyor code

  def process_item(self, item, spider): if item['info']['spider'] == 'Spider1': logging.error('Spider1 pipeline works') elif item['info']['spider'] == 'Spider2': logging.error('Spider2 pipeline works') elif item['info']['spider'] == 'Spider3': logging.error('Spider3 pipeline works') 

Hope this saves time for someone!

0
Jun 21 '19 at
source share



All Articles