XPath selects image links - the parent href link from img src only if it exists, otherwise select the img src link

I ran into a somewhat complicated XPath problem. Consider this HTML code for part of a web page (I used Imgur and replaced the text):

<a href="//i.imgur.com/ahreflink.jpg" class="zoom"> <img class="post-image-placeholder" src="//i.imgur.com/imgsrclink.jpg"> </img> </a> 

First I want to find all the img tags in the document and find their corresponding src es. Then I want to check if the img src link contains the image file extension (.jpeg, .jpg, .gif, .png). If it does not contain an image extension, do not capture it. In this case, it has an image extension. Now we want to find out which link we want to capture. Since parent href exists, we must grab the appropriate link.

Desired result: //i.imgur.com/ahreflink.jpg

But now let <<24> not exist:

 <a name="missing! oh no!"> <img class="post-image-placeholder" src="//i.imgur.com/imgsrclink.jpg"> </img> </a> 

Desired result: //i.imgur.com/imgsrclink.jpg

How do I build this XPath? If this helps, I also use Python (Scrapy) with XPath. Therefore, if the problem needs to be separated, Python can also be used.

+5
source share
2 answers

This is very simple to do in a single xpath expression:

 //a[not(@href)]/img/@src | //a[img]/@href 
+4
source

You do not need to do this in a single XPath expression. Here is a specific implementation of Scrapy that excludes checking for image expansion (judging by the comments, you already understood this):

 images = response.xpath("//a/img") for image in images: a_link = image.xpath("../@href").extract_first() image_link = image.xpath("@src").extract_first() print(a_link or image_link) 
+4
source

All Articles