I am trying to parse an html page using BeautifulSoup, but it seems that BeautifulSoup doesn't like html or this page at all. When I run the code below, the prettify () method returns only the script block of the page (see below). Does anyone have an idea why this is happening?
import urllib2 from BeautifulSoup import BeautifulSoup url = "http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1" html = "".join(urllib2.urlopen(url).readlines()) print "-- HTML ------------------------------------------" print html print "-- BeautifulSoup ---------------------------------" print BeautifulSoup(html).prettify()
This is a result created by BeautifulSoup.
-- BeautifulSoup --------------------------------- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <script language="JavaScript"> <!-- function highlight(img) { document[img].src = "/marketing/sony/images/en/" + img + "_on.gif"; } function unhighlight(img) { document[img].src = "/marketing/sony/images/en/" + img + "_off.gif"; } </script>
Thanks!
UPDATE: I am using the next version, which is apparently the latest.
__author__ = "Leonard Richardson ( leonardr@segfault.org )" __version__ = "3.1.0.1" __copyright__ = "Copyright (c) 2004-2009 Leonard Richardson" __license__ = "New-style BSD"
source share