How can I group data scraped from multiple pages using Scrapy into one element?

I am trying to collect several pieces of information about a bunch of different websites. I want to create one Item for each site that summarizes the information I found on this site, regardless of which page I found it on.

It seems to me that this should be a conveyor of elements, for example, a duplicate filter , except that I need the final contents of the Item , and not the results of the first page that the crawler examined.

Therefore, I tried using request.meta to pass one partially filled Item through various Request for this site. To do this, I needed my parser to return exactly one new Request per call, until it had more pages to visit, and then finally return the finished Item . Which is a pain if I find several links that I want to follow, and completely breaks if the scheduler throws one of the requests due to the link cycle.

The only other approach that I see is to unload the spider output into json-lines and execute it after the external tool. But I would rather roll it into a spider, preferably in an intermediate layer conveyor or conveyor. How can i do this?

+4
source share
3 answers

How about this ugly decision?

Define a dictionary (defaultdict (list)) in the pipeline for storing data on each site. In process_item, you can simply add dict (item) to the list of elements for each site and raise the DropItem exception. Then, in the close_spider method, you can discard the data as you wish.

Should work theoretically, but I'm not sure if this solution is the best.

+3
source

If you need a summary, collecting statistics will be a different approach http://doc.scrapy.org/en/0.16/topics/stats.html

eg:

to get a shared page crawled on each of the websites. use the following code.

 stats.inc_value('pages_crawled:%s'%socket.gethostname()) 
0
source

I had the same problems when I was writing my scanner.

I solved this problem by passing a list of URLs through a meta tag and linking them together.

Refer to the detailed manual that I wrote here.

0
source

All Articles