Remove all inline styles with BeautifulSoup

I am doing HTML cleanup with BeautifulSoup. Noob for Python and BeautifulSoup. I have tags that are deleted correctly, as shown below, based on the answer I found elsewhere in Stackoverflow:

[s.extract() for s in soup('script')] 

But how to remove inline styles? For example, the following:

 <p class="author" id="author_id" name="author_name" style="color:red;">Text</p> <img class="some_image" href="somewhere.com"> 

It should become:

 <p>Text</p> <img href="somewhere.com"> 

How to remove the built-in attributes of a class, identifier, name and style for all elements?

Answers to other similar questions. I could find everything mentioned with a CSS parser to deal with this, and not with BeautifulSoup, but since the task is to simply remove and not manipulate attributes, and is a general rule for all tags, I was hoping to find way to do it all in BeautifulSoup.
+9
python css inline beautifulsoup
source share
5 answers

You do not need to parse any CSS if you just want to remove it. BeautifulSoup provides a way to remove all attributes:

 for tag in soup(): for attribute in ["class", "id", "name", "style"]: del tag[attribute] 

Also, if you just want to remove whole tags (and their contents), you do not need extract() , which returns the tag. You just need decompose() :

 [tag.decompose() for tag in soup("script")] 

Not a big difference, but just something else that I found looking at the documents. You can find more details about the API in the BeautifulSoup documentation , with many examples.

+26
source share

I would not do this in BeautifulSoup - you will spend a lot of time trying, testing and working on extreme cases.

Bleach does just that for you. http://pypi.python.org/pypi/bleach

If you did this in BeautifulSoup , I would advise you to go with a white list, for example, Bleach . Determine which tags may have which attributes, and split each tag / attribute that does not match.

+10
source share

Based on the jmk function, I use this function to remove the attribute database in the white list:

Work in python2, BeautifulSoup3

 def clean(tag,whitelist=[]): tag.attrs = None for e in tag.findAll(True): for attribute in e.attrs: if attribute[0] not in whitelist: del e[attribute[0]] #e.attrs = None #delte all attributes return tag #example to keep only title and href clean(soup,["title","href"]) 
+1
source share

Here is my solution for Python3 and BeautifulSoup4:

 def remove_attrs(soup, whitelist=tuple()): for tag in soup.findAll(True): for attr in [attr for attr in tag.attrs if attr not in whitelist]: del tag[attr] return soup 

It maintains an attribute whitelist that needs to be maintained. :) If no whitelist is supplied, all attributes are deleted.

+1
source share

Not perfect, but short:

 ' '.join([el.text for tag in soup for el in tag.findAllNext(whitelist)]); 
0
source share

All Articles