Using Beautiful Soup to Get the Full URL in the Source Code

So, I was looking at some source code, and I came across this bit of code

<img src="/gallery/2012-winners-finalists/HM_Watching%20birds2_Shane%20Conklin_MA_2012.jpg" 

the link is now blue in the source code, and when you click on it, it brings to the full URL where this image is located, I know how to get what is shown in the Python source code using Beautiful Soup. I was interested to know how to get the full URL that you get when you click the link in the source code?

EDIT: if I was given <a href = "/folder/big/a.jpg" , how do you know the source of this URL through python or beautiful soup?

+8
python
source share
2 answers
 <a href="/folder/big/a.jpg"> 

This is the absolute address for the current host. Therefore, if the HTML file is located at http://example.com/foo/bar.html , then using url /folder/big/a.jpg will result in the following:

 http://example.com/folder/big/a.jpg 

those. enter the host name and apply a new path to it.

Python has a built-in urljoin function to perform this operation for you:

 >>> from urllib.parse import urljoin >>> base = 'http://example.com/foo/bar.html' >>> href = '/folder/big/a.jpg' >>> urljoin(base, href) 'http://example.com/folder/big/a.jpg' 

For Python 2, the function is in the urlparse module.

+20
source share
 from bs4 import BeautifulSoup import requests import lxml r = requests.get("http://example.com") url = r.url # this is base url data = r.content # this is content of page soup = BeautifulSoup(data, 'lxml') temp_url = soup.find('a')['href'] # you need to modify this selector if temp_url[0:7] == "http://" or temp_url[0:8] == "https://" : # if url have http:// url = temp_url else: url = url + temp_url print url # this is your full url 
0
source share

All Articles