Using Scrapy with Javascript and iFrames and Alternatives

I am trying to use Scrapy to clean the U.S. government regulations website (www.regulations.gov). He got a ton of information about this, but it is a terrible website filled with javascript and iframes. I tried to run some simple Scrapy spiders, but I can’t parse anything because everything is loaded through Javascript and iframes.

For example, on the main search page, this code block actually loads the results table:

<script type="text/javascript" src="Regs/Regs.nocache.js?REGS211-b3"></script> <title>Regulations.gov</title> <link rel="stylesheet" type="text/css" href="css/print.css" media="print" /> </head> <body class="bodyLoading"> <!-- this is required for GWT history support --> <iframe src="javascript:''" id="__gwt_historyFrame" tabIndex='-1' style="position:absolute;width:0;height:0;border:0"></iframe> <!-- For printing window contents --> <iframe id="__printingFrame" style="width:0;height:0;border:0;" ></iframe> 

And individual results pages have the same problem. For example, this page has the same source as above.

Can Scrapy handle this problem at all? Are there any alternatives that may be available?

+4
source share
1 answer

Alternatives: try

1) selenium

2) imacros

3) PhantomJS with CasperJS

+3
source

All Articles