Problems with the BeautifulSoup parser

I am trying to parse an html page using BeautifulSoup, but it seems that BeautifulSoup doesn't like html or this page at all. When I run the code below, the prettify () method returns only the script block of the page (see below). Does anyone have an idea why this is happening?

import urllib2 from BeautifulSoup import BeautifulSoup url = "http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1" html = "".join(urllib2.urlopen(url).readlines()) print "-- HTML ------------------------------------------" print html print "-- BeautifulSoup ---------------------------------" print BeautifulSoup(html).prettify() 

This is a result created by BeautifulSoup.

 -- BeautifulSoup --------------------------------- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <script language="JavaScript"> <!-- function highlight(img) { document[img].src = "/marketing/sony/images/en/" + img + "_on.gif"; } function unhighlight(img) { document[img].src = "/marketing/sony/images/en/" + img + "_off.gif"; } //--> </script> 

Thanks!

UPDATE: I am using the next version, which is apparently the latest.

 __author__ = "Leonard Richardson ( leonardr@segfault.org )" __version__ = "3.1.0.1" __copyright__ = "Copyright (c) 2004-2009 Leonard Richardson" __license__ = "New-style BSD" 
+4
source share
7 answers

Try using version 3.0.7a as Łukasz . BeautifulSoup 3.1 was designed for compatibility with Python 3.0, so they had to change the parser from SGMLParser to HTMLParser, which seems more vulnerable to bad HTML.

From the changelog for BeautifulSoup 3.1 :

"Beautiful Soup is now based on HTMLParser, not on SGMLParser, which went into Python 3. There's some bad HTML processed by SGMLParser, but HTMLParser is not."

+6
source

Try lxml . Despite its name, it is also designed to parse and clean HTML. This is much, much faster than BeautifulSoup, and even handles broken HTML better than BeautifulSoup, so it may work better for you. It also has a compatibility API for BeautifulSoup if you don't want to learn the lxml API.

Ian Blicking agrees .

It makes no sense to use BeautifulSoup anymore if you are not using the Google App Engine or something where nothing is permitted except Python.

+4
source

BeautifulSoup is not magic: if the incoming HTML is too terrible, it will not work.

In this case, the incoming HTML is exactly this: too broken for BeautifulSoup to figure out what to do. For example, it contains markup like:

SCRIPT type = "javascript" "

(Note the double quote.)

BeautifulSoup docs contain a section on what you can do if BeautifulSoup cannot parse the markup. You will need to study these alternatives.

+2
source

Sam: If I get things like HTMLParser.HTMLParseError: bad end tag: u"</scr' + 'ipt>" I just remove the culprit from the markup before submitting it to BeautifulSoup, and that's all dandy:

 html = urllib2.urlopen(url).read() html = html.replace("</scr' + 'ipt>","") soup = BeautifulSoup(html) 
+2
source

I also had trouble parsing the following code:

 <script> function show_ads() { document.write("<div><sc"+"ript type='text/javascript'src='http://pagead2.googlesyndication.com/pagead/show_ads.js'></scr"+"ipt></div>"); } </script> 

HTMLParseError: bad end tag: u '', line 26, column 127

Sam

+1
source

I tested this script in BeautifulSoup version 3.0.7a and it returns what seems to be correct. I do not know what has changed between "3.0.7a" and "3.1.0.1", but try.

0
source
 import urllib from BeautifulSoup import BeautifulSoup >>> page = urllib.urlopen('http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1') >>> soup = BeautifulSoup(page) >>> soup.prettify() 

In my case, following the above statements, it returns the entire HTML page.

0
source

All Articles