Massage with BeatifulSoup or cleaning with Regex

Can someone tell me that the best way to clear bad HTML so that BeautifulSoup can handle it is to use BeautifulSoup massage methods or clear it with regular expressions?

Thank.

+5
source share
2 answers

I think I should rewrite my answer.

The built-in massage is good for light damage (extra spaces, no slashes, etc.). Of course, I would try to get away from them before taking part.

You can go through your own sessions , and I suggest you expand the default set:

import copy, re

myMassage = [(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))]
myNewMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
myNewMassage.extend(myMassage)

BeautifulSoup(badString, markupMassage=myNewMassage)
# Foo<!--This comment is malformed.-->Bar<br />Baz

, , , , BeautifulSoups... , , .

+3

, - (regular expression, replacement function), , .

. :

(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))

_feed BeautifulSoup.py, , :

for fix, m in self.markupMassage:
  markup = fix.sub(m, markup)

, , BeautifulSoup , , , , MARKUP_MASSAGE , .

+2

All Articles