Python Scrapy: allowed_domains adding new domains from the database

I need to add more domains to allowed_domains, so I do not get "Filtered request from an external server".

My application gets the URLs to retrieve from the database, so I cannot add them manually.

I tried to override the spider init

like this

 def __init__(self):
        super( CrawlSpider, self ).__init__()
        self.start_urls = []
        for destination in Phpbb.objects.filter(disable=False):
                self.start_urls.append(destination.forum_link)

            self.allowed_domains.append(destination.link)

start_urls was fine, this was my first question to solve. but allow_domains has no effect.

Do I need to change some configuration to disable domain verification? I do not want this, since I only need those from the database, but this can help me disable domain verification.

thank!!

+3
source share
1 answer
  • 'allowed_domains' . , , .
  • scrapy/contrib/spidermiddleware/offsite.py :

    def get_host_regex(self, spider):
        """Override this method to implement a different offsite policy"""
        allowed_domains = getattr(spider, 'allowed_domains', None)
        if not allowed_domains:
            return re.compile('') # allow all by default
        domains = [d.replace('.', r'\.') for d in allowed_domains]
        regex = r'^(.*\.)?(%s)$' % '|'.join(domains)
        return re.compile(regex)
    
+4

All Articles