Python web scraping - how to get resources with beautiful soup when a page loads content via JS?

Question

Python web scraping - how to get resources with beautiful soup when a page loads content via JS?

So, I'm trying to clear a table from a specific site using BeautifulSoup and urllib. My goal is to create a single list of all the data in this table. I tried using the same code using tables from other sites, and it works great. However, trying to use this site, the table returns a NoneType object. Can someone help me with this? I tried looking for other answers on the Internet, but I was not very lucky.

Here is the code:

import requests import urllib from bs4 import BeautifulSoup soup = BeautifulSoup(urllib.request.urlopen("http://www.teamrankings.com/ncaa-basketball/stat/free-throw-pct").read()) table = soup.find("table", attrs={'class':'sortable'}) data = [] rows = table.findAll("tr") for tr in rows: cols = tr.findAll("td") for td in cols: text = ''.join(td.find(text=True)) data.append(text) print(data)

+5

python screen-scraping urllib beautifulsoup

QwErTy99 Apr 20 '15 at 16:47

source share

3 answers

The table on this website is created through javascript, and therefore does not exist when you just throw the source code into BeautifulSoup.

Either you need to start digging around with your web inspector of your choice and find out where javascript is from - or you should use something like selenium to launch a full browser instance.

+4

Eric Apr 20 '15 at 16:55

source share

Because table data is loaded dynamically, there is some lag that updates the table data due to many reasons, such as network latency. This way you can wait for a while to delay and read data. Check if the table data is, that is, the length is zero, if so read the table data after some delay. This will help.

Looked at the URL you used. Because you are using a class selector for a table. make sure there are other places in HTML

0

shri Apr 20 '15 at 17:03

source share

Farmer joe · Accepted Answer · 2015-04-20T16:56:54+0000

It looks like this data is being loaded through an ajax call:

enter image description here

Instead, you should target this URL: http://www.teamrankings.com/ajax/league/v3/stats_controller.php

 import requests import urllib from bs4 import BeautifulSoup params = { "type":"team-detail", "league":"ncb", "stat_id":"3083", "season_id":"312", "cat_type":"2", "view":"stats_v1", "is_previous":"0", "date":"04/06/2015" } content = urllib.request.urlopen("http://www.teamrankings.com/ajax/league/v3/stats_controller.php",data=urllib.parse.urlencode(params).encode('utf8')).read() soup = BeautifulSoup(content) table = soup.find("table", attrs={'class':'sortable'}) data = [] rows = table.findAll("tr") for tr in rows: cols = tr.findAll("td") for td in cols: text = ''.join(td.find(text=True)) data.append(text) print(data)

Using your web inspector, you can also view the parameters that are passed along with the POST request.

enter image description here

Typically, the server at the other end checks these values and rejects your request if you do not have some or all of them. The above code snippet went fine for me. I switched to urllib2 because I usually prefer to use this library.

If the data is loaded in your browser, you can clear it. You just need to emulate the request sent by your browser.

Python web scraping - how to get resources with beautiful soup when a page loads content via JS?

More articles: