Parsing BeautifulSoup 3.1 breaks too easily

Question

Parsing BeautifulSoup 3.1 breaks too easily

I'm having trouble parsing some dodgy HTML using BeautifulSoup. It turns out that the HTMLParser used in newer versions is less tolerant than the previously used SGMLParser.

Does BeautifulSoup have some sort of debugging mode? I am trying to figure out how to stop it while struggling with some unpleasant HTML that I am downloading from the crab website:

<HTML>
    <HEAD>
        <TITLE>Title</TITLE>
        <HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
    </HEAD>
    <BODY>
        ...
        ...
    </BODY>
</HTML>

BeautifulSoup refuses after the tag <HTTP-EQUIV...>

In [1]: print BeautifulSoup(c).prettify()
<html>
 <head>
  <title>
   Title
  </title>
 </head>
</html>

Obviously, the problem is the HTTP-EQUIV tag, which is really a very distorted tag <META HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">. Obviously, I need to indicate this as self-closing, but no matter what I specify, I cannot fix it:

In [2]: print BeautifulSoup(c,selfClosingTags=['http-equiv',
                            'http-equiv="pragma"']).prettify()
<html>
 <head>
  <title>
   Title
  </title>
 </head>
</html>

, BeautifulSoup , , , ?

+3

python html parsing beautifulsoup

Mat 19 . '09 23:07

3

Beautiful Soup 3.1.0? html5lib .

#!/usr/bin/env python
from html5lib import HTMLParser, treebuilders

parser = HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))

c = """<HTML>
    <HEAD>
        <TITLE>Title</TITLE>
        <HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
    </HEAD>
    <BODY>
        ...
        ...
    </BODY>
</HTML>"""

soup = parser.parse(c)
print soup.prettify()

:

<html>
 <head>
  <title>
   Title
  </title>
 </head>
 <body>
  <http-equiv="pragma" content="NO-CACHE">
   ...
        ...
  </http-equiv="pragma">
 </body>
</html>

, html5lib .

+6

jfs 12 . '09 13:20

lxml ( html-). , HTML. , , BeautifulSoup, "" HTML , BeautifulSoup. API BeautifulSoup, API- lxml.

Ian Blicking .

BeautifulSoup , Google App Engine - , , Python.

+3

aehlke 03 . '09 15:40

ShreevatsaR · Accepted Answer · 2009-01-19T23:40:08+0000

- ; :

In [1]: import BeautifulSoup

In [2]: c = """<HTML>
   ...:     <HEAD>
   ...:         <TITLE>Title</TITLE>
   ...:         <HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
   ...:     </HEAD>
   ...:     <BODY>
   ...:         ...
   ...:         ...
   ...:     </BODY>
   ...: </HTML>
   ...: """

In [3]: print BeautifulSoup.BeautifulSoup(c).prettify()
<html>
 <head>
  <title>
   Title
  </title>
  <http-equiv>
  </http-equiv>
 </head>
 <body>
  ...
        ...
 </body>
</html>


In [4]:

Python 2.5.2 BeautifulSoup 3.0.7a - , / ? BeautifulSoup, , , - ... - , ?

Parsing BeautifulSoup 3.1 breaks too easily

More articles: