Is it possible to remove script tags using BeautifulSoup?

Is it possible to remove script tags and all their contents from HTML using BeautifulSoup, or do I need to use regular expressions or something else?

+55
python html beautifulsoup
Apr 08 2018-11-11T00:
source share
3 answers
>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup('<script>a</script>baba<script>b</script>', 'lxml') >>> [s.extract() for s in soup('script')] >>> soup baba 
+107
Apr 08 2018-11-11T00:
source share

As stated in the official documentation ( ), you can use the extract method to remove the entire subtree that matches the search.

 import BeautifulSoup a = BeautifulSoup.BeautifulSoup("<html><body><script>aaa</script></body></html>") [x.extract() for x in a.findAll('script')] 
+12
Apr 08 '11 at 17:33
source share

An updated answer for those who may need it for future reference: The correct answer. decompose() You can use different methods, but decompose works in place.

Usage example:

 soup = BeautifulSoup('<p>This is a slimy text and <i> I am slimer</i></p>') soup.i.decompose() print str(soup) #prints '<p>This is a slimy text and</p>' 

It’s pretty useful to get rid of detritus, like 'script', 'img', etc.

+10
Oct 09 '16 at 15:11
source share



All Articles