How safe is it to accept a predefined set of harmless HTML tags from a request?

One of the first things I learned as a web developer was to never accept HTML from a client. (Perhaps only if I code it in HTML).
I use the WYSIWYG editor (TinyMCE), which outputs HTML. So far I have only used it on the admin page, but now I would like to use it on the forum as well. It has a BBCode module, but this seems incomplete. (Perhaps BBCode itself does not support everything I want.)

So here is my idea:

I allow the client directly the HTML POST code. Then I check the code for common sense ( correctness ) and remove all CSS tags, attributes and rules that are not allowed based on a predefined set of allowed tags and styles.
Obviously, I authorize material that can be output by a subset of the TinyMCE function I use.

I would allow the following tags:
span , sub , sup , a , p , ul , ol , li , img , strong , em , br

With the following attributes:
style (for everything), href and title (for a ), alt and src (for img )

And the following CSS rules:
color , font , font-size , font-weight , font-style , text-decoration

They cover everything that I need to format, and (as far as I know) do not pose a security risk. In principle, the observance of correctness and the absence of any layout styles prevent someone from damaging the site layout. Disabling script tag and similar files prevents XSS.
(One exception: maybe I should allow width / height in a predefined range for images.)

Another advantage: this material will save me from having to write / search for a BBCode-Html converter.

What do you think? Is this a safe thing?

(As I can see, StackOverflow also allows you to use some basic HTML in the "About me" field, so I think I'm not the first to implement this.)

EDIT:

I found this answer that explains how to do this quite easily.
And, of course, no one should think about using regex for this .

The question itself is not related to any language or technology, but if you're interested, I am writing this application in ASP.NET.

+6
html security tags tinymce
source share
3 answers

It’s not clear which programming language you use or prefer, but in Java there is Jsoup , which is a fairly smooth HTML parser API that contains, among other things, an HTML cleaner based on a custom whitelist of HTML tags and attributes (unfortunately, there are no CSS rules since this completely excludes the capabilities of the HTML parser). Here is an excerpt from your site .

Sanitize untrusted HTML

Problem

You want to allow untrusted users to provide HTML for output to your site (for example, as a comment submission). You need to clear this HTML to avoid cross-site scripting (XSS).

Decision

Use jsoup HTML Cleaner with the configuration specified by Whitelist .

 String unsafe = "<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>"; String safe = Jsoup.clean(unsafe, Whitelist.basic()); // now: <p><a href="http://example.com/" rel="nofollow">Link</a></p> 

Whitelist class itself contains several predefined whitelists that can be useful, for example, Whitelist#basic() and Whitelist#relaxed() .

For .NET, by the way, there is a Jsoup port named NSoup

+5
source share

For PHP, check the HTML Cleaner , it is filtered out with very advanced customizable settings (such as allowed / forbidden tags, attributes, styles, etc.), including XSS and complex style (e.g. display: none ).

In addition, TinyMCE performs a little filtering, but since the client side you still should not trust it.

+2
source share

Of the tags you plan to allow, <a> definitely needs extra attention because of the javascript: URL capability. And, of course, you need to disable javascript event handlers from all tags.

+1
source share

All Articles