Scrapy: Understanding How Items and Queries Work Between Callbacks

I am struggling with Scrapy, and I do not understand how exactly the elements go between callbacks. Maybe someone can help me.

I look at http://doc.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions

def parse_page1(self, response): item = MyItem() item['main_url'] = response.url request = scrapy.Request("http://www.example.com/some_page.html", callback=self.parse_page2) request.meta['item'] = item return request def parse_page2(self, response): item = response.meta['item'] item['other_url'] = response.url return item 

I am trying to understand the sequence of actions there, step by step:

[parse_page1]

  • item = MyItem() <- object object created
  • item['main_url'] = response.url <- we assign the value main_url of the object's object
  • request = scrapy.Request("http://www.example.com/some_page.html", callback=self.parse_page2) <- we request a new page and run parse_page2 to refuse it.

[parse_page2]

  1. item = response.meta['item'] <- I do not understand here. Are we creating a new object object, or is it an object element created in [parse_page1]? And what does response.meta ['item'] mean? We pass the request in 3 only information, such as link and callback, we did not add any additional arguments that we could refer to ...
  2. item['other_url'] = response.url <- we assign the value of other_url of the object's object
  3. return item <- we return the object of the object as a result of the request

[parse_page1]

  1. request.meta['item'] = item <- Do we assign an object to the object for the request? But the request is complete, the callback has already returned the item at 6 ????
  2. return request <- we get the results of the request, so the element is from 6, am I right?

I looked through all the documentation regarding scrapy and request / response / meta, but still I do not understand what is happening in points 4 and 7 here.

+2
python scrapy
source share
2 answers
 line 4: request = scrapy.Request("http://www.example.com/some_page.html", callback=self.parse_page2) line 5: request.meta['item'] = item line 6: return request 

You are confused in the previous code, let me explain it (I listed it here):

  • In line 4, you create an instance of the scrapy.Request object, it does not work like other other request libraries, here you do not call the URL, but you still do not execute the callback function.

  • You add arguments to the scrapy.Request object on line 5, so for example, you can also declare a scrapy.Request object as follows:

     request = scrapy.Request("http://www.example.com/some_page.html", callback=self.parse_page2, meta={'item': item})` 

    and you could escape line 5.

  • In line 6, when you call the scrapy.Request object, and when scrapy makes it work, for example, call the specified URL, go to the next callback and pass meta with it, you also avoided line 6 (and line 5) if You invoked the request as follows:

     return scrapy.Request("http://www.example.com/some_page.html", callback=self.parse_page2, meta={'item': item})` 

So the idea is that your callback methods should return (preferably yield ) a Request or or Item , scrapy will output the Item and continue traversing the Request .

+3
source share

@eLRuLL the answer is wonderful. I want to add an element conversion part. First, we will make it clear that the callback function only works until the response to this request is reloaded.

in the scrapy.doc code, it does not declare the url and request of page 1 and. Let page url be given as " http: //www.example.com.html ".

[parse_page1] is a callback

 scrapy.Request("http://www.example.com.html",callback=parse_page1)` 

[parse_page2] is a callback

 scrapy.Request("http://www.example.com/some_page.html",callback=parse_page2) 

when the response of page1 is loaded, parse_page1 is called to generate a page2 request:

 item['main_url'] = response.url # send "http://www.example.com.html" to item request = scrapy.Request("http://www.example.com/some_page.html", callback=self.parse_page2) request.meta['item'] = item # store item in request.meta 

after loading the response of page2, parse_page2 is called to retrieve the element:

 item = response.meta['item'] #response.meta is equal to request.meta,so here item['main_url'] ="http://www.example.com.html". item['other_url'] = response.url # response.url ="http://www.example.com/some_page.html" return item #finally,we get the item recordind urls of page1 and page2. 
0
source share

All Articles