Remove all inline styles with BeautifulSoup

Question

Remove all inline styles with BeautifulSoup

I am doing HTML cleanup with BeautifulSoup. Noob for Python and BeautifulSoup. I have tags that are deleted correctly, as shown below, based on the answer I found elsewhere in Stackoverflow:

[s.extract() for s in soup('script')]

But how to remove inline styles? For example, the following:

 <p class="author" id="author_id" name="author_name" style="color:red;">Text</p> <img class="some_image" href="somewhere.com">

It should become:

 <p>Text</p> <img href="somewhere.com">

How to remove the built-in attributes of a class, identifier, name and style for all elements?

Answers to other similar questions. I could find everything mentioned with a CSS parser to deal with this, and not with BeautifulSoup, but since the task is to simply remove and not manipulate attributes, and is a general rule for all tags, I was hoping to find way to do it all in BeautifulSoup.

+9

python css inline beautifulsoup

La Oct 18 '12 at 16:27

source share

5 answers

I would not do this in BeautifulSoup - you will spend a lot of time trying, testing and working on extreme cases.

Bleach does just that for you. http://pypi.python.org/pypi/bleach

If you did this in BeautifulSoup , I would advise you to go with a white list, for example, Bleach . Determine which tags may have which attributes, and split each tag / attribute that does not match.

+10

Jonathan vanasco Oct 18 '12 at 16:47

source share

Based on the jmk function, I use this function to remove the attribute database in the white list:

Work in python2, BeautifulSoup3

 def clean(tag,whitelist=[]): tag.attrs = None for e in tag.findAll(True): for attribute in e.attrs: if attribute[0] not in whitelist: del e[attribute[0]] #e.attrs = None #delte all attributes return tag #example to keep only title and href clean(soup,["title","href"])

+1

Laputaprince Jul 26 '13 at 21:33

source share

Here is my solution for Python3 and BeautifulSoup4:

 def remove_attrs(soup, whitelist=tuple()): for tag in soup.findAll(True): for attr in [attr for attr in tag.attrs if attr not in whitelist]: del tag[attr] return soup

It maintains an attribute whitelist that needs to be maintained. :) If no whitelist is supplied, all attributes are deleted.

+1

techouse Apr 1 '16 at 13:19

source share

Not perfect, but short:

 ' '.join([el.text for tag in soup for el in tag.findAllNext(whitelist)]);

0

Radio controlled Jun 12 '19 at 10:13

source share

jmk · Accepted Answer · 2012-10-18T16:41:09+0000

You do not need to parse any CSS if you just want to remove it. BeautifulSoup provides a way to remove all attributes:

 for tag in soup(): for attribute in ["class", "id", "name", "style"]: del tag[attribute]

Also, if you just want to remove whole tags (and their contents), you do not need extract() , which returns the tag. You just need decompose() :

 [tag.decompose() for tag in soup("script")]

Not a big difference, but just something else that I found looking at the documents. You can find more details about the API in the BeautifulSoup documentation , with many examples.

Remove all inline styles with BeautifulSoup

More articles: