Scan pages using PostBack javascript Python Scrapy data

I am browsing some directories using ASP.NET programming through Scrapy.

Pages for slippage are encoded as such:

javascript:__doPostBack('ctl00$MainContent$List','Page$X')

where X is an int between 1 and 180. The argument of MainContent is always the same. I have no idea how to get into them. I would like to add something to the SLE rules as simple as allow=('Page$') or attrs='__doPostBack' , but I assume that I need to be more complex to get the information out of the javascript link.

If it’s easier for you to β€œexpose” each of the absolute links from javascript code and save them in csv, then use this csv to load requests into a new scraper, which is also good.

+5
source share
1 answer

This type of pagination is not as trivial as it might seem. It was an interesting task to solve it. There are several important points regarding the solution below:

  • The idea here is to keep track of page page by page passing through the current page in the Request.meta dictionary
  • using a regular BaseSpider , as there is some logic related to pagination
  • it's important to provide headers pretending to be a real browser
  • it is important to do a FormRequest with dont_filter=True , since we basically do a POST request to the same URL, but with different parameters

The code:

 import re from scrapy.http import FormRequest from scrapy.spider import BaseSpider HEADERS = { 'X-MicrosoftAjax': 'Delta=true', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36' } URL = 'http://exitrealty.com/agent_list.aspx?firstName=&lastName=&country=USA&state=NY' class ExitRealtySpider(BaseSpider): name = "exit_realty" allowed_domains = ["exitrealty.com"] start_urls = [URL] def parse(self, response): # submit a form (first page) self.data = {} for form_input in response.css('form#aspnetForm input'): name = form_input.xpath('@name').extract()[0] try: value = form_input.xpath('@value').extract()[0] except IndexError: value = "" self.data[name] = value self.data['ctl00$MainContent$ScriptManager1'] = 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList' self.data['__EVENTTARGET'] = 'ctl00$MainContent$List' self.data['__EVENTARGUMENT'] = 'Page$1' return FormRequest(url=URL, method='POST', callback=self.parse_page, formdata=self.data, meta={'page': 1}, dont_filter=True, headers=HEADERS) def parse_page(self, response): current_page = response.meta['page'] + 1 # parse agents (TODO: yield items instead of printing) for agent in response.xpath('//a[@class="regtext"]/text()'): print agent.extract() print "------" # request the next page data = { '__EVENTARGUMENT': 'Page$%d' % current_page, '__EVENTVALIDATION': re.search(r"__EVENTVALIDATION\|(.*?)\|", response.body, re.MULTILINE).group(1), '__VIEWSTATE': re.search(r"__VIEWSTATE\|(.*?)\|", response.body, re.MULTILINE).group(1), '__ASYNCPOST': 'true', '__EVENTTARGET': 'ctl00$MainContent$agentList', 'ctl00$MainContent$ScriptManager1': 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList', '': '' } return FormRequest(url=URL, method='POST', formdata=data, callback=self.parse_page, meta={'page': current_page}, dont_filter=True, headers=HEADERS) 
+12
source

Source: https://habr.com/ru/post/1215095/


All Articles