Using Beautiful Soup to remove html tags from a string

Question

Using Beautiful Soup to remove html tags from a string

Does anyone have a code example that illustrates how to use Python Beautiful Soup to remove all html tags except some from a line of text?

I want to remove all javascript and html tags except:

<a></a>
<b></b>
<i></i>

And also things like:

<a onclick=""></a>

Thanks for the help - I could not find much on the Internet for this purpose.

+5

python beautifulsoup

ensnare Dec 12 '10 at 20:48

source share

1 answer

unutbu · Accepted Answer · 2010-12-12T21:27:36+0000

import BeautifulSoup

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onclick="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)

for tag in soup.recursiveChildGenerator():
    if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a','b','i'):
        print(tag)

gives

<i>paragraph</i>
<a onclick="">one</a>
<i>paragraph</i>
<b>two</b>

If you just need text content, you can change print(tag)to print(tag.string).

If you want to remove the type attribute onclick=""from the tag a, you can do this:

if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a','b','i'):
    if tag.name=='a':
        del tag['onclick']
    print(tag)

Using Beautiful Soup to remove html tags from a string

More articles: