The only way to guarantee that some HTML markup does not contain JavaScript is to filter it on all insecure HTML tags and attributes to prevent Cross-Site Scripting (XSS).
However, as a rule, there is no reliable way to explicitly delete all unsafe elements and attributes by their names, as some browsers can interpret those that you did not even know at the time of design, and thus open protection for intruders. That's why you are much better off using a whitelist rather than a blacklist . In other words, only those HTML tags that you are sure are safe and by default delete all others. Indeed, only one randomly permitted tag can make your site vulnerable to XSS.
White List (good approach)
See the HTML sanitisation article for some specific examples of why you should use the whitelist rather than the blacklist. Quote from this page:
Here is a partial list of potentially dangerous HTML tags and attributes:
script that may contain malicious scriptapplet , embed and object , which can automatically download and execute malicious codemeta , which may contain malicious redirectsonload , onunload and all other on* attributes that may contain malicious scriptstyle , link and style attribute, which may contain malicious script
Here's another useful page that offers a set of HTML tags and attributes, as well as CSS attributes that are usually safe to use, as well as recommended methods.
Blacklist (usually a bad approach)
Despite the fact that many sites in the past (and currently) use the blacklist approach, there is almost no real need for it. (Security risks invariably lead to the loss of potential restrictions provided to the user, taking into account the formatting capabilities that are provided to the user.) You need to be aware of its shortcomings.
For example, this page provides a list of what are supposedly βallβ HTML tags that you might want to remove. Just noticing this briefly, you should notice that it contains a very limited number of element names; the browser can easily include a proprietary tag that inadvertently allows scripts to run on your page, which is the main problem with the blacklist.
Finally, I highly recommend that you use the HTML DOM library (like the well-known HTML Agility Pack ) for .NET, as opposed to RegEx to do the cleanup / whitelisting, since it will be much more reliable. (It's quite possible to create some pretty crazy messy HTML that can spoof regular expressions! The right HTML reader / writer makes coding the system a lot easier.)
We hope that this should give you a decent overview of what you need for development in order to completely (or at least as much as possible) prevent XSS and how important it is that HTML sanitation is performed taking into account an unknown factor.
Noldorin
source share