How to collect data from multiple pages into a single data structure using scrapy

Question

How to collect data from multiple pages into a single data structure using scrapy

I am trying to clear data from a site. Data is structured as several objects, each of which has a data set. For example, people with names, ages, and occupations.

My problem is that this data is divided into two levels on the website.
The first page is, for example, a list of names and ages with a link to each person’s profile page.
Their profile page lists their profession.

I already have a spider written using scrapy in python, which can collect data from the top level and scan through several pagination.
But how can I collect data from internal pages, keeping them linked to the corresponding object?

I currently have an output structured with json as

{[name='name',age='age',occupation='occupation'], [name='name',age='age',occupation='occupation']} etc

Can the parsing function reach such pages?

+7

json python web-crawler scrapy

user2071236 Feb 14 '13 at 8:34

source share

1 answer

akhter wahab · Answer 1 · 2013-02-14T09:11:23+0000

here is the way you need to do it. you need to return / return the element when the element has all the attributes

 yield Request(page1, callback=self.page1_data) def page1_data(self, response): hxs = HtmlXPathSelector(response) i = TestItem() i['name']='name' i['age']='age' url_profile_page = 'url to the profile page' yield Request(url_profile_page, meta={'item':i}, callback=self.profile_page) def profile_page(self,response): hxs = HtmlXPathSelector(response) old_item=response.request.meta['item'] # parse other fileds # assign them to old_item yield old_item

How to collect data from multiple pages into a single data structure using scrapy

More articles: