Python, Scrapy, Pipeline: function "process_item" does not receive a call

I have a very simple code shown below. The scraper is fine, I can see all the print statements generating the correct data. In Pipeline initialization works fine. However, the process_item function is not called because the print statement at the beginning of the function is never executed.

Spider: comosham.py

 import scrapy from scrapy.spider import Spider from scrapy.selector import Selector from scrapy.http import Request from activityadvisor.items import ComoShamLocation from activityadvisor.items import ComoShamActivity from activityadvisor.items import ComoShamRates import re class ComoSham(Spider): name = "comosham" allowed_domains = ["www.comoshambhala.com"] start_urls = [ "http://www.comoshambhala.com/singapore/classes/schedules", "http://www.comoshambhala.com/singapore/about/location-contact", "http://www.comoshambhala.com/singapore/rates-and-offers/rates-classes", "http://www.comoshambhala.com/singapore/rates-and-offers/rates-classes/rates-private-classes" ] def parse(self, response): category = (response.url)[39:44] print 'in parse' if category == 'class': pass """self.gen_req_class(response)""" elif category == 'about': print 'about to call parse_location' self.parse_location(response) elif category == 'rates': pass """self.parse_rates(response)""" else: print 'Cant find appropriate category! check check check!! Am raising Level 5 ALARM - You are a MORON :D' def parse_location(self, response): print 'in parse_location' item = ComoShamLocation() item['category'] = 'location' loc = Selector(response).xpath('((//div[@id = "node-2266"]/div/div/div)[1]/div/div/p//text())').extract() item['address'] = loc[2]+loc[3]+loc[4]+(loc[5])[1:11] item['pin'] = (loc[5])[11:18] item['phone'] = (loc[9])[6:20] item['fax'] = (loc[10])[6:20] item['email'] = loc[12] print item['address'],item['pin'],item['phone'],item['fax'],item['email'] return item 

Data File:

 import scrapy from scrapy.item import Item, Field class ComoShamLocation(Item): address = Field() pin = Field() phone = Field() fax = Field() email = Field() category = Field() 

Pipeline file:

 class ComoShamPipeline(object): def __init__(self): self.locationdump = csv.writer(open('./scraped data/ComoSham/ComoshamLocation.csv','wb')) self.locationdump.writerow(['Address','Pin','Phone','Fax','Email']) def process_item(self,item,spider): print 'processing item now' if item['category'] == 'location': print item['address'],item['pin'],item['phone'],item['fax'],item['email'] self.locationdump.writerow([item['address'],item['pin'],item['phone'],item['fax'],item['email']]) else: pass 
+7
python scrapy pipeline
source share
4 answers

Your problem is that you never give in to the subject. parse_location returns an element for parsing, but parse never returns this element.

The solution would be to replace:

 self.parse_location(response) 

from

 yield self.parse_location(response) 

More specifically, process_item will never be called if no elements are provided.

+9
source share

Use ITEM_PIPELINES in settings.py:

 ITEM_PIPELINES = ['project_name.pipelines.pipeline_class'] 
+1
source share

Adding to the answers above,
1. Remember to add the following line to settings.py! ITEM_PIPELINES = {'[YOUR_PROJECT_NAME].pipelines.[YOUR_PIPELINE_CLASS]': 300} 2. Remove the item when your spider runs! yield my_item

0
source share

This solved my problem: I selected all the elements before calling Pipeline, so process_item () was not called, but open_spider was called and called close_spider. So the tmy solution just changed the order of use of this pipeline before another pipeline that removes elements.

Scrapy Pipeline Documentation

Just remember that Scrapy calls Pipeline.process_item () only if there is an item to process!

0
source share

All Articles