Scrapy allows all domains

I saw this post to do scrapy crawl of any site without permission of the allowed domains.

Is there a better way to do this, for example, using a regular expression in a valid domain variable, for example -

allowed_domains = ["*"]

Hope there is more to breaking the scrapy framework for this.

+5
source share
2 answers

Do not set valid_domains at all.

Take a look at the get_host_regex () function in this file:

https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spidermiddleware/offsite.py

+11
source
+1

All Articles