Scrapy Shell and Scrapy Splash

We used scrapy-splash to transfer the cleaned HTML source code through the Splash javascript engine, which runs inside the docker container.

If we want to use Splash in a spider, we configure several required project parameters and issue a Request specifying specific meta arguments :

 yield Request(url, self.parse_result, meta={ 'splash': { 'args': { # set rendering arguments here 'html': 1, 'png': 1, # 'url' is prefilled from request url }, # optional parameters 'endpoint': 'render.json', # optional; default is render.json 'splash_url': '<url>', # overrides SPLASH_URL 'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN, } }) 

This works as documented. But how can we use scrapy-splash inside the Scrapy Shell ?

+14
python web-scraping scrapy scrapy-shell scrapy-splash
source share
3 answers

just wrap the url you want to copy in the pop up http api .

So you need something like:

 scrapy shell 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5' 

where localhost:port is your url splash service is the url you want to execute and don't forget urlquote ! render.html is one of the possible http api endpoints, returns a reded html page in this case
timeout time in seconds for a timeout
wait time in seconds to wait for javascript to execute before reading / saving html.

+21
source share

You can run the scrapy shell with no arguments inside the configured Scrapy project, then create req = scrapy_splash.SplashRequest(url, ...) and call fetch(req) .

+14
source share

For Windows users who use the Docker Toolbox:

  1. Change the single quote to double quote to prevent invalid hostname:http error.

  2. change localhost to the docker IP that is under the whale logo. for me it was 192.168.99.100 .

Finally I got this:

scrapy shell "http://192.168.99.100:8050/render.html?url="https://samplewebsite.com/category/banking-insurance-financial-services/""

0
source share

All Articles