Understanding Callbacks in Scrapy

I am new to Python and Scrapy. I have not used callback functions before. However, I am doing now for the code below. The first request will be executed, and the response of this will be sent to the callback function, defined as the second argument:

def parse_page1(self, response): item = MyItem() item['main_url'] = response.url request = Request("http://www.example.com/some_page.html", callback=self.parse_page2) request.meta['item'] = item return request def parse_page2(self, response): item = response.meta['item'] item['other_url'] = response.url return item 

I can not understand the following things:

  • How is an item populated?
  • Is the request.meta line before the response.meta line in parse_page2 ?
  • Where is the returned item returned from parse_page2 ?
  • What is the need for a return request in parse_page1 ? I thought that the recovered items should be returned here.
+7
python callback scrapy
source share
3 answers

Read the docs :

For spiders, the scraper cycle goes through something like this:

  • You start by creating initial requests to scan the first URLs and specifying a callback function called with the response downloaded from these requests.

    The first execution requests are obtained by calling start_requests() , which (by default) generates Request for the URLs specified in the start_urls and parse start_urls as callbacks for requests.

  • In the callback function, you parse the response (web page) and return the Item , Request objects, or the iterability of both. These requests will also contain a callback (possibly the same) and will then be loaded by Scrapy, and then their response will be processed by the specified callback.

  • In the callback functions, you parse the content of the page, usually with Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate elements with the data being analyzed.

  • Finally, items returned from a spider are usually stored in a database (in some Pipeline items) or written to a file using feed export.

Answers:

How is 'item' populated, does the request.meta line run until the response.meta line in parse_page2 ?

Spiders are driven by a Scrapy engine. First, it requests the URLs specified in start_urls and passes them to the loader. Upon loading, the callback specified in the request completes. If the callback returns another request, the same thing is repeated. If the callback returns Item , the item is passed to the pipeline to save the cleared data.

Where is the returned element returned from parse_page2 ?

What is the need for return request in parse_page1 ? I thought that the recovered items should be returned here?

As indicated in the docs, each callback (both parse_page1 and parse_page2 ) can return either Request or Item (or their iterability). parse_page1 returns Request not Item , because additional information must be cleared of the additional URL. The second parse_page2 returns an element because all the information is cleared and ready to be sent to the pipeline.

+14
source share
  • yes, scrapy uses a twisted reactor to invoke spider functions, so using one cycle with one thread ensures that
  • the calling spider function call expects to either receive an element / s or request / s in return, requests are queued for future processing, and elements are sent to configured pipelines.
  • saving an element (or any other data) in a meta request makes sense only if it is necessary for further processing after receiving a response, otherwise, obviously, it is better to just return it from parse_page1 and avoid an additional HTTP request
+1
source share

in scrapy: Understanding how elements and queries work between callbacks , eLRuLL's answer is wonderful.

I want to add an element conversion part. First, we will make it clear that the callback function only works until the response to this request is reloaded.

in the scrapy.doc code, it does not declare the url and request of page 1 and. Let page url be given as " http: //www.example.com.html ".

[parse_page1] is a callback

 scrapy.Request("http://www.example.com.html",callback=parse_page1)` 

[parse_page2] is a callback

 scrapy.Request("http://www.example.com/some_page.html",callback=parse_page2) 

when the response of page1 is loaded, parse_page1 is called to generate a page2 request:

 item['main_url'] = response.url # send "http://www.example.com.html" to item request = scrapy.Request("http://www.example.com/some_page.html", callback=self.parse_page2) request.meta['item'] = item # store item in request.meta 

after loading the response of page2, parse_page2 is called to retrieve the element:

 item = response.meta['item'] #response.meta is equal to request.meta,so here item['main_url'] #="http://www.example.com.html". item['other_url'] = response.url # response.url ="http://www.example.com/some_page.html" return item #finally,we get the item recording urls of page1 and page2. 
+1
source share

All Articles