Getting attribute value using BeautifulSoup

Question

Getting attribute value using BeautifulSoup

I am writing a python script that will retrieve the locations of a script after parsing from a web page. Suppose there are two scenarios:

<script type="text/javascript" src="http://example.com/something.js"></script>

and

 <script>some JS</script>

I can get JS from the second script, that is, when JS is written inside the tags.

But is there any way I could get the src value from the first script (i.e. retrieve all the src tag values in the script, e.g. http://example.com/something.js )

Here is my code

 #!/usr/bin/python import requests from bs4 import BeautifulSoup r = requests.get("http://rediff.com/") data = r.text soup = BeautifulSoup(data) for n in soup.find_all('script'): print n

Output : some JS

Required output : http://example.com/something.js

+8

python python-2.7 beautifulsoup

aditya.gupta Sep 11 '13 at 5:03

source share

3 answers

Get 'src' from script node.

 import requests from bs4 import BeautifulSoup r = requests.get("http://rediff.com/") data = r.text soup = BeautifulSoup(data) for n in soup.find_all('script'): print "src:", n.get('src') <====

+5

rajpy Sep 11 '13 at 5:16

source share

This should work, you just filter to find all script tags and then determine if they have the 'src' attribute. If they do, then the javascript url is contained in the src attribute, otherwise we assume that javascript is in the tag

 #!/usr/bin/python import requests from bs4 import BeautifulSoup # Test HTML which has both cases html = '<script type="text/javascript" src="http://example.com/something.js">' html += '</script> <script>some JS</script>' soup = BeautifulSoup(html) # Find all script tags for n in soup.find_all('script'): # Check if the src attribute exists, and if it does grab the source URL if 'src' in n.attrs: javascript = n['src'] # Otherwise assume that the javascript is contained within the tags else: javascript = n.text print javascript

This conclusion of this

 http://example.com/something.js some JS

+1

Ashok fernandez Sep 11 '13 at 9:40

source share

Venkateshwaran selvaraj · Accepted Answer · 2013-09-11T09:42:26+0000

It will receive all src values only if they are present. Or else it will skip the <script>

 from bs4 import BeautifulSoup import urllib2 url="http://rediff.com/" page=urllib2.urlopen(url) soup = BeautifulSoup(page.read()) sources=soup.findAll('script',{"src":True}) for source in sources: print source['src']

I get the following two src values as a result

 http://imworld.rediff.com/worldrediff/js_2_5/ws-global_hm_1.js http://im.rediff.com/uim/common/realmedia_banner_1_5.js

I think this is what you want. Hope this is helpful.

Getting attribute value using BeautifulSoup

More articles: