Combining base url with resulting href in scrapy

below is my spider code,

class Blurb2Spider(BaseSpider): name = "blurb2" allowed_domains = ["www.domain.com"] def start_requests(self): yield self.make_requests_from_url("http://www.domain.com/bookstore/new") def parse(self, response): hxs = HtmlXPathSelector(response) urls = hxs.select('//div[@class="bookListingBookTitle"]/a/@href').extract() for i in urls: yield Request(urlparse.urljoin('www.domain.com/', i[1:]),callback=self.parse_url) def parse_url(self, response): hxs = HtmlXPathSelector(response) print response,'------->' 

Here I try to combine the href link with the base link, but I get the following error,

 exceptions.ValueError: Missing scheme in request url: www.domain.com//bookstore/detail/3271993?alt=Something+I+Had+To+Do 

Can someone tell me why I get this error and how to join the base url with the href link and give a request

+8
python url scrapy
source share
2 answers

This is because you have not added this scheme, for example http: // to your base url.

Try: urlparse.urljoin('http://www.domain.com/', i[1:])

Or even simpler: urlparse.urljoin(response.url, i[1:]) , since urlparse.urljoin will sort the base URL.

+9
source share

Alternative solution if you do not want to use urlparse :

response.urljoin(i[1:])

This decision goes even further: here Scrapy is developing a domain base for joining. And, as you can see, you do not need to provide the obvious http://www.example.com for the connection.

This will make your code reusable in the future if you want to change the domain you are viewing.

0
source share

All Articles