Relative URL for absolute URL correction

Question

Relative URL for absolute URL correction

Need help converting a relative url to an absolute url in a scrapy spider. I need to convert the links to the start pages to an absolute URL in order to get images of the scratched elements that are on the start pages. I tried unsuccessfully different ways to achieve this, and I'm stuck. Any suggestion?

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/billboard",
        "http://www.example.com/billboard?page=1"
    ]

def parse(self, response): 
        image_urls = response.xpath('//div[@class="content"]/section[2]/div[2]/div/div/div/a/article/img/@src').extract()
        relative_url = response.xpath(u'''//div[contains(concat(" ", normalize-space(@class), " "), " content ")]/a/@href''').extract()     

        for image_url,url in zip(image_urls,absolute_urls):
            item = ExampleItem()
            item['image_urls'] = image_urls

        request = Request(url, callback=self.parse_dir_contents)
        request.meta['item'] = item
        yield request

+4

scrapy

jacquesseite Mar 18 '16 at 13:38

source share

1 answer

Paulo Romeira · Answer 1 · 2017-08-08T04:03:18+0000

This is mainly achieved in three ways:

Using a function urljoinfrom urllib:

from urllib.parse import urljoin
# Same as: from w3lib.url import urljoin

url = urljoin(base_url, relative_url)

Using the wrapper method urljoinas directed by Steve .
```
url = response.urljoin(relative_url)
```

, follow:

# It will create a new request using the above "urljoin" method
yield response.follow(relative_url, callback=self.parse)

Relative URL for absolute URL correction

More articles: