Beautifulsoup 4: Remove the comment tag and its contents

Question

Beautifulsoup 4: Remove the comment tag and its contents

So the page I'm using contains these html codes. How to remove comment tag  along with its contents using bs4 ?

 <div class="foo"> cat dog sheep goat <!-- <p>NewPP limit report Preprocessor node count: 478/300000 Post‐expand include size: 4852/2097152 bytes Template argument size: 870/2097152 bytes Expensive parser function count: 2/100 ExtLoops count: 6/100 </p> --> </div>

+12

python html html-parsing web-scraping beautifulsoup

Flint Apr 25 '14 at 17:34

source share

3 answers

Usually changing the bs4 parsing tree is not required. You can just get the div text if you need it:

 soup.body.div.text Out[18]: '\ncat dog sheep goat\n\n'

bs4 separates the comment. However, if you really need to change the parse tree:

 from bs4 import Comment for child in soup.body.div.children: if isinstance(child,Comment): child.extract()

+5

roippi Apr 25 '14 at 17:42

source share

From this answer If you are looking for a solution in BeautifulSoup version 3 BS3 Docs - Comment

 soup = BeautifulSoup("""Hello! <!--I've got to be nice to get what I want.-->""") comment = soup.find(text=re.compile("if")) Comment=comment.__class__ for element in soup(text=lambda text: isinstance(text, Comment)): element.extract() print soup.prettify()

0

Vanjith Apr 16 '19 at 9:59

source share

alecxe · Accepted Answer · 2014-04-25T17:43:07+0000

You can use extract() (solution based on this answer ):

PageElement.extract () removes the tag or line from the tree. Returns the tag or string that was retrieved.

 from bs4 import BeautifulSoup, Comment data = """<div class="foo"> cat dog sheep goat <!-- <p>test</p> --> </div>""" soup = BeautifulSoup(data) div = soup.find('div', class_='foo') for element in div(text=lambda text: isinstance(text, Comment)): element.extract() print soup.prettify()

As a result, you will get your div without comment:

 <div class="foo"> cat dog sheep goat </div>

Beautifulsoup 4: Remove the comment tag and its contents

More articles: