How to bypass bot / ddos cloud protection in Scrapy?

Question

How to bypass bot / ddos cloud protection in Scrapy?

I sometimes looked for an e-commerce web page to get product pricing information. I did not use a scraper created using Scrapy, and yesterday I tried to use it - I had a problem with protecting bots.

It uses DDOS CloudFlares protection, which mainly uses JavaScript evaluation to filter browsers (and therefore the scraper) with JS disabled. As soon as the function is evaluated, a response with a calculated number is generated. In turn, the service sends back two authentication cookies, which are attached to each request, which usually allow you to crawl the site. Here is a description of how this works.

I also found a cloudflare-scrape Python module that uses an external JS evaluation engine to calculate the number and send the request back to the server. I am not sure how to integrate it into Scrapy . Or maybe a more reasonable way without using JS execution? In the end, it is a form ...

I would advise any help.

+7

javascript python cookies scrapy

Cloudide Oct 20 '15 at 10:07

source share

3 answers

If it’s good for you to compromise a bit of speed during the cleaning process, you can combine Scrapy with Selenium to emulate the real user experience with the browser. I wrote a short tutorial here: http://www.6020peaks.com/2014/12/how-to-scrape-hidden-web-data-with-python .

It does not target your specific problem with CloudFlare, but it may help, as I had similar problems loading data needed to execute JS.

+2

narko Oct 21 '15 at 7:24

source share

Obviously, the best way to do this is to list your IP address in CloudFlare; if this does not fit, I recommend the cloudflare-scrape library. You can use this to get a cookie token, and then provide that cookie token in a Scrapy request to the server.

+1

mjsa Oct 21 '15 at 6:43

source share

Cloudide · Accepted Answer · 2015-10-22T21:01:10+0000

So, I executed JavaScript using Python using cloudflare-scrape .

In your scraper you need to add the following code:

def start_requests(self): cf_requests = [] for url in self.start_urls: token, agent = cfscrape.get_tokens(url, 'Your prefarable user agent, _optional_') cf_requests.append(Request(url=url, cookies={'__cfduid': token['__cfduid']}, headers={'User-Agent': agent})) return cf_requests

along with parsing functions. What is it!

Of course, you need to install cloudflare-scrape first and import it to your spider. You also need a JS runtime engine. I already had Node.JS, no complaints.

How to bypass bot / ddos ​​cloud protection in Scrapy?

More articles:

How to bypass bot / ddos cloud protection in Scrapy?