One of the first things I learned as a web developer was to never accept HTML from a client. (Perhaps only if I code it in HTML).
I use the WYSIWYG editor (TinyMCE), which outputs HTML. So far I have only used it on the admin page, but now I would like to use it on the forum as well. It has a BBCode module, but this seems incomplete. (Perhaps BBCode itself does not support everything I want.)
So here is my idea:
I allow the client directly the HTML POST code. Then I check the code for common sense ( correctness ) and remove all CSS tags, attributes and rules that are not allowed based on a predefined set of allowed tags and styles.
Obviously, I authorize material that can be output by a subset of the TinyMCE function I use.
I would allow the following tags:
span , sub , sup , a , p , ul , ol , li , img , strong , em , br
With the following attributes:
style (for everything), href and title (for a ), alt and src (for img )
And the following CSS rules:
color , font , font-size , font-weight , font-style , text-decoration
They cover everything that I need to format, and (as far as I know) do not pose a security risk. In principle, the observance of correctness and the absence of any layout styles prevent someone from damaging the site layout. Disabling script tag and similar files prevents XSS.
(One exception: maybe I should allow width / height in a predefined range for images.)
Another advantage: this material will save me from having to write / search for a BBCode-Html converter.
What do you think? Is this a safe thing?
(As I can see, StackOverflow also allows you to use some basic HTML in the "About me" field, so I think I'm not the first to implement this.)
EDIT:
I found this answer that explains how to do this quite easily.
And, of course, no one should think about using regex for this .
The question itself is not related to any language or technology, but if you're interested, I am writing this application in ASP.NET.