I am trying to clear data from a site. Data is structured as several objects, each of which has a data set. For example, people with names, ages, and occupations.
My problem is that this data is divided into two levels on the website.
The first page is, for example, a list of names and ages with a link to each person’s profile page.
Their profile page lists their profession.
I already have a spider written using scrapy in python, which can collect data from the top level and scan through several pagination.
But how can I collect data from internal pages, keeping them linked to the corresponding object?
I currently have an output structured with json as
{[name='name',age='age',occupation='occupation'], [name='name',age='age',occupation='occupation']} etc
Can the parsing function reach such pages?
user2071236
source share