Combining base url with resulting href in scrapy

Question

Combining base url with resulting href in scrapy

below is my spider code,

class Blurb2Spider(BaseSpider): name = "blurb2" allowed_domains = ["www.domain.com"] def start_requests(self): yield self.make_requests_from_url("http://www.domain.com/bookstore/new") def parse(self, response): hxs = HtmlXPathSelector(response) urls = hxs.select('//div[@class="bookListingBookTitle"]/a/@href').extract() for i in urls: yield Request(urlparse.urljoin('www.domain.com/', i[1:]),callback=self.parse_url) def parse_url(self, response): hxs = HtmlXPathSelector(response) print response,'------->'

Here I try to combine the href link with the base link, but I get the following error,

 exceptions.ValueError: Missing scheme in request url: www.domain.com//bookstore/detail/3271993?alt=Something+I+Had+To+Do

Can someone tell me why I get this error and how to join the base url with the href link and give a request

+8

python url scrapy

shiva krishna May 29 '12 at 11:20

source share

2 answers

Alternative solution if you do not want to use urlparse :

response.urljoin(i[1:])

This decision goes even further: here Scrapy is developing a domain base for joining. And, as you can see, you do not need to provide the obvious http://www.example.com for the connection.

This will make your code reusable in the future if you want to change the domain you are viewing.

0

Ghajba Oct 14 '17 at 15:33

source share

Sjaak trekhaak · Accepted Answer · 2012-05-29T12:07:50+0000

This is because you have not added this scheme, for example http: // to your base url.

Try: urlparse.urljoin('http://www.domain.com/', i[1:])

Or even simpler: urlparse.urljoin(response.url, i[1:]) , since urlparse.urljoin will sort the base URL.

Combining base url with resulting href in scrapy

More articles: