Using Beautiful Soup to Get the Full URL in the Source Code

Question

Using Beautiful Soup to Get the Full URL in the Source Code

So, I was looking at some source code, and I came across this bit of code

<img src="/gallery/2012-winners-finalists/HM_Watching%20birds2_Shane%20Conklin_MA_2012.jpg"

the link is now blue in the source code, and when you click on it, it brings to the full URL where this image is located, I know how to get what is shown in the Python source code using Beautiful Soup. I was interested to know how to get the full URL that you get when you click the link in the source code?

EDIT: if I was given <a href = "/folder/big/a.jpg" , how do you know the source of this URL through python or beautiful soup?

+8

python

user2476540 Jul 31 '13 at 13:59

source share

2 answers

poke · Answer 1 · 2013-08-01T16:24:04+0000

 <a href="/folder/big/a.jpg">

This is the absolute address for the current host. Therefore, if the HTML file is located at http://example.com/foo/bar.html , then using url /folder/big/a.jpg will result in the following:

 http://example.com/folder/big/a.jpg

those. enter the host name and apply a new path to it.

Python has a built-in urljoin function to perform this operation for you:

 >>> from urllib.parse import urljoin >>> base = 'http://example.com/foo/bar.html' >>> href = '/folder/big/a.jpg' >>> urljoin(base, href) 'http://example.com/folder/big/a.jpg'

For Python 2, the function is in the urlparse module.

Biplob das · Answer 2 · 2019-10-11T05:43:41+0000

 from bs4 import BeautifulSoup import requests import lxml r = requests.get("http://example.com") url = r.url # this is base url data = r.content # this is content of page soup = BeautifulSoup(data, 'lxml') temp_url = soup.find('a')['href'] # you need to modify this selector if temp_url[0:7] == "http://" or temp_url[0:8] == "https://" : # if url have http:// url = temp_url else: url = url + temp_url print url # this is your full url

Using Beautiful Soup to Get the Full URL in the Source Code

More articles: