Scrapy Shell and Scrapy Splash

Question

Scrapy Shell and Scrapy Splash

We used scrapy-splash to transfer the cleaned HTML source code through the Splash javascript engine, which runs inside the docker container.

If we want to use Splash in a spider, we configure several required project parameters and issue a Request specifying specific meta arguments :

 yield Request(url, self.parse_result, meta={ 'splash': { 'args': { # set rendering arguments here 'html': 1, 'png': 1, # 'url' is prefilled from request url }, # optional parameters 'endpoint': 'render.json', # optional; default is render.json 'splash_url': '<url>', # overrides SPLASH_URL 'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN, } })

This works as documented. But how can we use scrapy-splash inside the Scrapy Shell ?

+14

python web-scraping scrapy scrapy-shell scrapy-splash

alecxe Feb 11 '16 at 23:56

source share

3 answers

You can run the scrapy shell with no arguments inside the configured Scrapy project, then create req = scrapy_splash.SplashRequest(url, ...) and call fetch(req) .

+14

Mikhail Korobov Apr 20 '16 at 13:42

source share

For Windows users who use the Docker Toolbox:

Change the single quote to double quote to prevent invalid hostname:http error.
change localhost to the docker IP that is under the whale logo. for me it was 192.168.99.100 .

Finally I got this:

scrapy shell "http://192.168.99.100:8050/render.html?url="https://samplewebsite.com/category/banking-insurance-financial-services/""

0

Uchiha AJ Jul 07 '19 at 5:40

source share

Granitosaurus · Accepted Answer · 2016-02-12T09:54:20+0000

just wrap the url you want to copy in the pop up http api .

So you need something like:

 scrapy shell 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5'

where localhost:port is your url splash service is the url you want to execute and don't forget urlquote ! render.html is one of the possible http api endpoints, returns a reded html page in this case
timeout time in seconds for a timeout
wait time in seconds to wait for javascript to execute before reading / saving html.

Scrapy Shell and Scrapy Splash

More articles: