I am trying to collect several pieces of information about a bunch of different websites. I want to create one Item for each site that summarizes the information I found on this site, regardless of which page I found it on.
It seems to me that this should be a conveyor of elements, for example, a duplicate filter , except that I need the final contents of the Item , and not the results of the first page that the crawler examined.
Therefore, I tried using request.meta to pass one partially filled Item through various Request for this site. To do this, I needed my parser to return exactly one new Request per call, until it had more pages to visit, and then finally return the finished Item . Which is a pain if I find several links that I want to follow, and completely breaks if the scheduler throws one of the requests due to the link cycle.
The only other approach that I see is to unload the spider output into json-lines and execute it after the external tool. But I would rather roll it into a spider, preferably in an intermediate layer conveyor or conveyor. How can i do this?
source share