I am trying to use Scrapy to clean the U.S. government regulations website (www.regulations.gov). He got a ton of information about this, but it is a terrible website filled with javascript and iframes. I tried to run some simple Scrapy spiders, but I canβt parse anything because everything is loaded through Javascript and iframes.
For example, on the main search page, this code block actually loads the results table:
<script type="text/javascript" src="Regs/Regs.nocache.js?REGS211-b3"></script> <title>Regulations.gov</title> <link rel="stylesheet" type="text/css" href="css/print.css" media="print" /> </head> <body class="bodyLoading"> <iframe src="javascript:''" id="__gwt_historyFrame" tabIndex='-1' style="position:absolute;width:0;height:0;border:0"></iframe> <iframe id="__printingFrame" style="width:0;height:0;border:0;" ></iframe>
And individual results pages have the same problem. For example, this page has the same source as above.
Can Scrapy handle this problem at all? Are there any alternatives that may be available?
source share