BeautifulSoup cannot find

Question

BeautifulSoup cannot find

I am trying to scrape nature.com to do some analysis of magazine articles. When I do the following:

import requests from bs4 import BeautifulSoup import re query = "http://www.nature.com/search?journal=nature&order=date_desc" for page in range (1, 10): req = requests.get(query + "&page=" + str(page)) soup = BeautifulSoup(req.text) cards = soup.findAll("li", "mb20 card cleared") matches = re.findall('mb20 card cleared', req.text) print(len(cards), len(matches))

I expect Beautifulsoup to print “25” (the number of search results) 10 times (one for each page), but it is not. Instead, it prints:

 14, 25 12, 25 25, 25 15, 25 15, 25 17, 25 17, 25 15, 25 14, 25

Looking at the html source, you will see that there should be 25 results per page, but Beautifulsoup seems to be confused here, and I cannot understand why.

Update 1 In case it matters, I work on Mac OSX Mavericks using Anaconda Python 2.7.10 and bs4 version 4.3.1

Update 2 I added a regex to show that req.text really contains what I'm looking for, but beautifulsoup doesn't find it

Update 3 . When I run this simple script several times, I sometimes get "Segmentation Error: 11". Don't know why

+5

python beautifulsoup

slaw Jun 06 '15 at 6:33

source share

1 answer

alecxe · Accepted Answer · 2015-06-07T00:03:06+0000

differences between the parsers used by BeautifulSoup under the hood .

Unless you explicitly specify a parser, BeautifulSoup will select the one based on the rank :

If you do not specify anything, you will get the best HTML parser that is installed. Beautiful Soup rates lxmls parser as the best, then html5libs, then Pythons built-in parser.

Define the parser explicitly:

 soup = BeautifulSoup(data, 'html5lib') soup = BeautifulSoup(data, 'html.parser') soup = BeautifulSoup(data, 'lxml')

BeautifulSoup cannot find

More articles: