Parsing BeautifulSoup 3.1 breaks too easily

I'm having trouble parsing some dodgy HTML using BeautifulSoup. It turns out that the HTMLParser used in newer versions is less tolerant than the previously used SGMLParser.


Does BeautifulSoup have some sort of debugging mode? I am trying to figure out how to stop it while struggling with some unpleasant HTML that I am downloading from the crab website:

<HTML>
    <HEAD>
        <TITLE>Title</TITLE>
        <HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
    </HEAD>
    <BODY>
        ...
        ...
    </BODY>
</HTML>

BeautifulSoup refuses after the tag <HTTP-EQUIV...>

In [1]: print BeautifulSoup(c).prettify()
<html>
 <head>
  <title>
   Title
  </title>
 </head>
</html>

Obviously, the problem is the HTTP-EQUIV tag, which is really a very distorted tag <META HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">. Obviously, I need to indicate this as self-closing, but no matter what I specify, I cannot fix it:

In [2]: print BeautifulSoup(c,selfClosingTags=['http-equiv',
                            'http-equiv="pragma"']).prettify()
<html>
 <head>
  <title>
   Title
  </title>
 </head>
</html>

, BeautifulSoup , , , ?

+3
3

- ; :

In [1]: import BeautifulSoup

In [2]: c = """<HTML>
   ...:     <HEAD>
   ...:         <TITLE>Title</TITLE>
   ...:         <HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
   ...:     </HEAD>
   ...:     <BODY>
   ...:         ...
   ...:         ...
   ...:     </BODY>
   ...: </HTML>
   ...: """

In [3]: print BeautifulSoup.BeautifulSoup(c).prettify()
<html>
 <head>
  <title>
   Title
  </title>
  <http-equiv>
  </http-equiv>
 </head>
 <body>
  ...
        ...
 </body>
</html>


In [4]: 

Python 2.5.2 BeautifulSoup 3.0.7a - , / ? BeautifulSoup, , , - ... - , ?

+2

Beautiful Soup 3.1.0? html5lib .

#!/usr/bin/env python
from html5lib import HTMLParser, treebuilders

parser = HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))

c = """<HTML>
    <HEAD>
        <TITLE>Title</TITLE>
        <HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
    </HEAD>
    <BODY>
        ...
        ...
    </BODY>
</HTML>"""

soup = parser.parse(c)
print soup.prettify()

:

<html>
 <head>
  <title>
   Title
  </title>
 </head>
 <body>
  <http-equiv="pragma" content="NO-CACHE">
   ...
        ...
  </http-equiv="pragma">
 </body>
</html>

, html5lib .

+6

lxml ( html-). , HTML. , , BeautifulSoup, "" HTML , BeautifulSoup. API BeautifulSoup, API- lxml.

Ian Blicking .

BeautifulSoup , Google App Engine - , , Python.

+3

All Articles