I am trying to scrape nature.com to do some analysis of magazine articles. When I do the following:
import requests from bs4 import BeautifulSoup import re query = "http://www.nature.com/search?journal=nature&order=date_desc" for page in range (1, 10): req = requests.get(query + "&page=" + str(page)) soup = BeautifulSoup(req.text) cards = soup.findAll("li", "mb20 card cleared") matches = re.findall('mb20 card cleared', req.text) print(len(cards), len(matches))
I expect Beautifulsoup to print β25β (the number of search results) 10 times (one for each page), but it is not. Instead, it prints:
14, 25 12, 25 25, 25 15, 25 15, 25 17, 25 17, 25 15, 25 14, 25
Looking at the html source, you will see that there should be 25 results per page, but Beautifulsoup seems to be confused here, and I cannot understand why.
Update 1 In case it matters, I work on Mac OSX Mavericks using Anaconda Python 2.7.10 and bs4 version 4.3.1
Update 2 I added a regex to show that req.text really contains what I'm looking for, but beautifulsoup doesn't find it
Update 3 . When I run this simple script several times, I sometimes get "Segmentation Error: 11". Don't know why