Javascript Link Web Address Search

I use RCurl in R to try to download data from a website, but I am having trouble finding which URL to use. Here is the site:

http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX

See how in the upper right corner above the displayed sheet there is a link to download data as a .csv file? I was wondering if there is a way to find the normal HTTP address for this .csv file, because RCurl cannot handle Javascript commands.

+6
javascript r
source share
4 answers

Clicking on the Download link executes this JavaScript snippet:

 __doPostBack('ctl00$MainPageLeft$MainPageContent$ExportHoldings1$LinkButton1','') 

This __doPostBack function simply fills a couple of hidden form fields on this page, and then sends a POST request.

A quick Google search shows that RCURL can send a POST request. So what you need to do is look in the source of this page, find a form called "aspnetForm", take all the fields from this form and create your own POST request, which will send the fields to the action URL ( http: // www. invescopowershares.com/products/holdings.aspx?ticker=PGX ).

I can’t guarantee that this will work. It seems like a hidden form field called __VIEWSTATE , which seems to encode some information, and I don't know how it is.

+7
source share

I will give you a quick and dirty way to get data. First, you can use Fiddler2 http://www.fiddler2.com/fiddler2/ to check the POST that your browser sends. This results in the following POST:

 POST http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX HTTP/1.1 Host: www.invescopowershares.com User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20100101 Firefox/13.0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip, deflate DNT: 1 Connection: keep-alive Referer: http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX Content-Type: application/x-www-form-urlencoded Content-Length: 70669 __EVENTTARGET=ctl00%24MainPageLeft%24MainPageContent%24ExportHoldings1%24LinkButton1&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwUKLTE1OTcxNjYzNw9kFgJmD2QWBAIDD2QWBAIDD2QWCAIBDw9kFgQeC2........ 

So, we see that 3 parameters are sent exactly __EVENTTARGET, __EVENTVALIDATION and __VIEWSTATE.

Required form to call postForm:

 postForm(ftarget, "form name" = "aspnetForm", "method" = "POST", "action" = "holdings.aspx?ticker=PGX", "id" = "aspnetForm","__EVENTTARGET"=event.target,"__EVENTVALIDATION"=event.val,"__VIEWSTATE"=view.state) 

Now comes a quick and dirty bit. I would just open the browser and get the relevant parameters, which it receives as follows:

 library(rcom) ie = comCreateObject('InternetExplorer.Application') ie[["visible"]]=T # true for debugging ie$Navigate2("http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX") while(comGetProperty(ie,"busy")||comGetProperty(ie,"ReadyState")<4){ Sys.sleep(1) print(comGetProperty(ie,"ReadyState")) } myDoc<-comGetProperty(ie,"Document") myPW<-comGetProperty(myDoc,"parentWindow") comInvoke(myPW,"execScript","var dumVar1=theForm.__EVENTVALIDATION.value;var dumVar2=theForm.__VIEWSTATE.value;","JavaScript") event.val<-myPW[["dumVar1"]] view.state<-myPW[["dumVar2"]] event.target<-"ctl00$MainPageLeft$MainPageContent$ExportHoldings1$LinkButton1" ie$Quit() ftarget<-"http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX" web.data<-postForm(ftarget, "form name" = "aspnetForm", "method" = "POST", "action" = "holdings.aspx?ticker=PGX", "id" = "aspnetForm","__EVENTTARGET"=event.target,"__EVENTVALIDATION"=event.val,"__VIEWSTATE"=view.state) write(web.data[1],'temp.csv') fin.data<-read.csv('temp.csv') > fin.data[1,] ticker SecurityNum Name CouponRate maturitydate 1 PGX 949746879 WELLS FARGO & COMPANY PFD 0.08 rating Shares PercentageOfFund PositionDate 1 BBB+/Baa3 2538656 0.04442112 06/11/2012 

__ EVENTVALIDATION, __VIEWSTATE may always be the same, or perhaps session cookies. You could probably get them using RCurl, but, as I said, this is a quick and dirty solution, and we just take the ones provided by Internet Explorer. What should be noted:

one). This requires windows with IE installed to use the rcom bit.

2). If you are using ie9, you may need to add invescopowershares.com to your compatibility view settings (since Microsoft seems to have blocked the call to event.val <-myPW [["dumVar1"]] like com)

EDIT (UPDATE)

Looking at the website in more detail __EVENTVALIDATION, __VIEWSTATE are set as javascript variables on the start page. We can simply parse them in a quick and dirty way as follows without resorting to calling the browser.

 dum<-getURL("http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX") event.target<-"ctl00$MainPageLeft$MainPageContent$ExportHoldings1$LinkButton1" event.val<-unlist(strsplit(dum,"__EVENTVALIDATION\" value=\""))[2] event.val<-unlist(strsplit(event.val,"\" />\r\n\r\n<script"))[1] view.state<-unlist(strsplit(dum,"id=\"__VIEWSTATE\" value=\""))[2] view.state<-unlist(strsplit(view.state,"\" />\r\n\r\n\r\n<script"))[1] ftarget<-"http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX" web.data<-postForm(ftarget, "form name" = "aspnetForm", "method" = "POST", "action" = "holdings.aspx?ticker=PGX", "id" = "aspnetForm","__EVENTTARGET"=event.target,"__EVENTVALIDATION"=event.val,"__VIEWSTATE"=view.state) write(web.data[1],'temp.csv') fin.data<-read.csv('temp.csv') 

The above should work with cross platform.

+10
source share

This is definitely a way to get a CSV file in RCurl, but I can't figure out which form fields I want to use in getForm to make it work. Should I use the fields from the doPostBack command attached to the "Download" link on the page, or should I use the fields from aspnetForm on the source page. For reference, we are interested in the aspnetForm field:

"form name =" aspnetForm "method =" post "action =" holdings.aspx? ticker = PGX "id =" aspnetForm "style =" margin: 0px ""

... and the postForm request that I just tried that didn't work was

postForm (" http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX ", "form name" = "aspnetForm", "method" = "post", "action" = "holdings.aspx? ticker = PGX "," id "=" aspnetForm "," style "=" margin: 0px ")

Thanks for the help!

+1
source share

Now there is a qmao package function that does this for you. (It is based on the code from the now-deleted answer to this question.)

You can use dlPowerShares follow these steps:

 require("qmao") Symbol <- "PGX" dat <- qmao:::dlPowerShares(event.target = "ctl00$MainPageLeft$MainPageContent$ExportHoldings1$LinkButton1", action = paste0("holdings.aspx?ticker=", Symbol)) > head(dat) ticker SecurityNum Name CouponRate maturitydate rating Shares PercentageOfFund PositionDate 1 PGX 173080201 CITIGROUP CAPITAL XIII 0.07875 10/30/2040 BB/Ba2 2998647 0.04274939 08/31/2012 2 PGX 949746879 WELLS FARGO & COMPANY PFD 0.08000 BBB+/Baa3 2549992 0.03935854 08/31/2012 3 PGX 06739H362 BARCLAYS BK PLC 0.08125 A-/Baa3 2757635 0.03644835 08/31/2012 4 PGX 46625H621 JPMORGAN CHASE 0.08625 BBB+/Baa1 2416021 0.03310707 08/31/2012 5 PGX 060505765 BANK OF AMERICA CORP PFD 8.2 0.08200 BB+/B1 2345508 0.03128002 08/31/2012 6 PGX 060505559 BANC OF AMERICA CORP PFD 8.625 0.08625 BB+/B1 2259484 0.03001599 08/31/2012 

In the above code, event.target is the first line inside javascript: __ doPostBack () the function that you will get when you right-click the Download link and Copy Link Address link.

action is the product-specific part of the action URL.

Internally, the code follows Jeff's suggestion in his answer and looks for the page source for the field values ​​for "aspnetForm". Then it uses these values ​​when calling postForm (from RCurl package.)

In qmao package , dlPowerShares uses getHoldings.powershares . In addition, getHoldings will call getHoldings.powershares if one of the Symbols passed to it is a PowerShares ETF symbol.


ps if qmao:::dlPowerShares is called with its default values, it will download a list of PowerShares products from http://www.invescopowershares.com/products/

+1
source share

All Articles