I ran into a somewhat complicated XPath problem. Consider this HTML code for part of a web page (I used Imgur and replaced the text):
<a href="//i.imgur.com/ahreflink.jpg" class="zoom"> <img class="post-image-placeholder" src="//i.imgur.com/imgsrclink.jpg"> </img> </a>
First I want to find all the img tags in the document and find their corresponding src es. Then I want to check if the img src link contains the image file extension (.jpeg, .jpg, .gif, .png). If it does not contain an image extension, do not capture it. In this case, it has an image extension. Now we want to find out which link we want to capture. Since parent href exists, we must grab the appropriate link.
Desired result: //i.imgur.com/ahreflink.jpg
But now let <<24> not exist:
<a name="missing! oh no!"> <img class="post-image-placeholder" src="//i.imgur.com/imgsrclink.jpg"> </img> </a>
Desired result: //i.imgur.com/imgsrclink.jpg
How do I build this XPath? If this helps, I also use Python (Scrapy) with XPath. Therefore, if the problem needs to be separated, Python can also be used.
source share