Beautifulsoup 4: Remove the comment tag and its contents

So the page I'm using contains these html codes. How to remove comment tag <!-- --> along with its contents using bs4 ?

 <div class="foo"> cat dog sheep goat <!-- <p>NewPP limit report Preprocessor node count: 478/300000 Post‐expand include size: 4852/2097152 bytes Template argument size: 870/2097152 bytes Expensive parser function count: 2/100 ExtLoops count: 6/100 </p> --> </div> 
+12
python html html-parsing web-scraping beautifulsoup
source share
3 answers

You can use extract() (solution based on this answer ):

PageElement.extract () removes the tag or line from the tree. Returns the tag or string that was retrieved.

 from bs4 import BeautifulSoup, Comment data = """<div class="foo"> cat dog sheep goat <!-- <p>test</p> --> </div>""" soup = BeautifulSoup(data) div = soup.find('div', class_='foo') for element in div(text=lambda text: isinstance(text, Comment)): element.extract() print soup.prettify() 

As a result, you will get your div without comment:

 <div class="foo"> cat dog sheep goat </div> 
+20
source share

Usually changing the bs4 parsing tree is not required. You can just get the div text if you need it:

 soup.body.div.text Out[18]: '\ncat dog sheep goat\n\n' 

bs4 separates the comment. However, if you really need to change the parse tree:

 from bs4 import Comment for child in soup.body.div.children: if isinstance(child,Comment): child.extract() 
+5
source share

From this answer If you are looking for a solution in BeautifulSoup version 3 BS3 Docs - Comment

 soup = BeautifulSoup("""Hello! <!--I've got to be nice to get what I want.-->""") comment = soup.find(text=re.compile("if")) Comment=comment.__class__ for element in soup(text=lambda text: isinstance(text, Comment)): element.extract() print soup.prettify() 
0
source share

All Articles