BeautifulSoup returns only what is inside the head tag

Question

BeautifulSoup returns only what is inside the head tag

I work with BeautifulSoup and either came up with an error or an error on my part. In my example, I am browsing the NY Times subsections site ...

import urllib2 from bs4 import BeautifulSoup website = "http://www.nytimes.com/pages/politics/index.html" data = BeautifulSoup(urllib2.urlopen(website).read()) print data

When I run the code, I come back with the title tag and what's inside it. However, it does not capture what is inside the body tags. If I changed the website URL to http://www.nytimes.com , BS will return the full page source. What's going on here, and why I do not get a body tag scanning http://www.nytimes.com/pages/politics/index.html ?

+4

python url web-crawler beautifulsoup

jason328 Jan 14 '13 at 2:10

source share

1 answer

Abhijit · Accepted Answer · 2013-01-14T07:29:53+0000

This is not a bug in BeautifulSoup. The problem is that bs4 uses the built-in HTMLParser, which is not very soft with garbled HTML, and as the W3C validation service shows the HTML is really garbled and has some unclosed, stray and unulocal TAGS that call HTMLParser and then BeautifulSoup to terminate the syntax analysis suddenly.

This issue was explained in the following error filed against BeautifulSoup

BS4 stops parsing after invalid tag

BeautifulSoup returns only what is inside the head tag

More articles: