Is there an open source Python library to sanitize HTML and remove all Javascript?

I want to write a web application that allows users to enter any HTML that may occur inside a <div> element. Then this HTML code will be displayed to other users, so I want to make sure that the site does not open to people before XSS attacks.

Is there a good library in Python that will clear all event handler attributes, <script> elements and other Javascript related to HTML or DOM tree?

I intend to use Beautiful Soup to organize my HTML to make sure that it doesn't contain any private tags or the like. But, as far as I can tell, it has no pre-packaged way to remove all Javascript.

If there is a good library in another language, this might also work, but I would prefer Python.

I did a bunch of google searches and hunted for pypi but couldn't find anything obvious.

Related

+4
source share
5 answers

A white approach to permitted tags, attributes and their values ​​is the only reliable way. Take a look at Recipe 496942: Cross-Site Scripting Protection (XSS)

What is wrong with existing markup languages, such as those used on this very site?

+4
source

As Klaus notes, a clear consensus in the community is to use BeautifulSoup to accomplish these tasks:

 soup = BeautifulSoup.BeautifulSoup(html) for script_elt in soup.findAll('script'): script_elt.extract() html = str(soup) 
+5
source

You can use BeautifulSoup . This allows you to easily move the layout structure, even if it was not well formed. I don’t know that something has been done for the order, which only works with script tags.

0
source

I would honestly look at using something like bbcode or other alternative markup.

0
source

Eric

Have you ever thought of using a SAX type parser for HTML? I'm really not sure though, nonetheless, he would have ignored the events. It would also be harder to build than using something like Beautiful Soup. A syntax error issue may also be related to SAX.

What I like to do in situations like building python objects (subclasses from the XML_Element class) from HTML parsing. Then remove any unwanted objects from the tree and finally re-serialize the objects back to html. This is not all that complicated in python.

Yours faithfully,

0
source

All Articles