Scan pages using PostBack javascript Python Scrapy data

Question

Scan pages using PostBack javascript Python Scrapy data

I am browsing some directories using ASP.NET programming through Scrapy.

Pages for slippage are encoded as such:

javascript:__doPostBack('ctl00$MainContent$List','Page$X')

where X is an int between 1 and 180. The argument of MainContent is always the same. I have no idea how to get into them. I would like to add something to the SLE rules as simple as allow=('Page$') or attrs='__doPostBack' , but I assume that I need to be more complex to get the information out of the javascript link.

If it’s easier for you to “expose” each of the absolute links from javascript code and save them in csv, then use this csv to load requests into a new scraper, which is also good.

+5

javascript python asp.net web-scraping scrapy

Xodarap777 Mar 10 '15 at 21:55

source share

1 answer

alecxe · Accepted Answer · 2015-03-11T00:45:37+0000

This type of pagination is not as trivial as it might seem. It was an interesting task to solve it. There are several important points regarding the solution below:

The idea here is to keep track of page page by page passing through the current page in the Request.meta dictionary
using a regular BaseSpider , as there is some logic related to pagination
it's important to provide headers pretending to be a real browser
it is important to do a FormRequest with dont_filter=True , since we basically do a POST request to the same URL, but with different parameters

The code:

 import re from scrapy.http import FormRequest from scrapy.spider import BaseSpider HEADERS = { 'X-MicrosoftAjax': 'Delta=true', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36' } URL = 'http://exitrealty.com/agent_list.aspx?firstName=&lastName=&country=USA&state=NY' class ExitRealtySpider(BaseSpider): name = "exit_realty" allowed_domains = ["exitrealty.com"] start_urls = [URL] def parse(self, response): # submit a form (first page) self.data = {} for form_input in response.css('form#aspnetForm input'): name = form_input.xpath('@name').extract()[0] try: value = form_input.xpath('@value').extract()[0] except IndexError: value = "" self.data[name] = value self.data['ctl00$MainContent$ScriptManager1'] = 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList' self.data['__EVENTTARGET'] = 'ctl00$MainContent$List' self.data['__EVENTARGUMENT'] = 'Page$1' return FormRequest(url=URL, method='POST', callback=self.parse_page, formdata=self.data, meta={'page': 1}, dont_filter=True, headers=HEADERS) def parse_page(self, response): current_page = response.meta['page'] + 1 # parse agents (TODO: yield items instead of printing) for agent in response.xpath('//a[@class="regtext"]/text()'): print agent.extract() print "------" # request the next page data = { '__EVENTARGUMENT': 'Page$%d' % current_page, '__EVENTVALIDATION': re.search(r"__EVENTVALIDATION\|(.*?)\|", response.body, re.MULTILINE).group(1), '__VIEWSTATE': re.search(r"__VIEWSTATE\|(.*?)\|", response.body, re.MULTILINE).group(1), '__ASYNCPOST': 'true', '__EVENTTARGET': 'ctl00$MainContent$agentList', 'ctl00$MainContent$ScriptManager1': 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList', '': '' } return FormRequest(url=URL, method='POST', formdata=data, callback=self.parse_page, meta={'page': current_page}, dont_filter=True, headers=HEADERS)

Scan pages using PostBack javascript Python Scrapy data

More articles: