JavaScript JavaScript filtering

I have a rich text editor that transfers HTML to the server. This HTML is then displayed to other users. I want JavaScript not to be in this HTML. Is there any way to do this?

Also, I use ASP.NET if that helps.

+6
javascript html xss sanitization filtering
source share
6 answers

The simplest task would be to cross out the tags with a regular expression. The problem is that you can do many unpleasant things without script tags (for example, embed dodgy images, have links to other sites that have nasty Javascript). Disabling HTML completely by converting fewer or more characters to their HTML object forms (like <) can also be an option.

If you want a more powerful solution, in the past I used AntiSamy to disinfect incoming text so that it is safe to view.

-2
source share

The only way to guarantee that some HTML markup does not contain JavaScript is to filter it on all insecure HTML tags and attributes to prevent Cross-Site Scripting (XSS).

However, as a rule, there is no reliable way to explicitly delete all unsafe elements and attributes by their names, as some browsers can interpret those that you did not even know at the time of design, and thus open protection for intruders. That's why you are much better off using a whitelist rather than a blacklist . In other words, only those HTML tags that you are sure are safe and by default delete all others. Indeed, only one randomly permitted tag can make your site vulnerable to XSS.


White List (good approach)

See the HTML sanitisation article for some specific examples of why you should use the whitelist rather than the blacklist. Quote from this page:

Here is a partial list of potentially dangerous HTML tags and attributes:

  • script that may contain malicious script
  • applet , embed and object , which can automatically download and execute malicious code
  • meta , which may contain malicious redirects
  • onload , onunload and all other on* attributes that may contain malicious script
  • style , link and style attribute, which may contain malicious script

Here's another useful page that offers a set of HTML tags and attributes, as well as CSS attributes that are usually safe to use, as well as recommended methods.

Blacklist (usually a bad approach)

Despite the fact that many sites in the past (and currently) use the blacklist approach, there is almost no real need for it. (Security risks invariably lead to the loss of potential restrictions provided to the user, taking into account the formatting capabilities that are provided to the user.) You need to be aware of its shortcomings.

For example, this page provides a list of what are supposedly β€œall” HTML tags that you might want to remove. Just noticing this briefly, you should notice that it contains a very limited number of element names; the browser can easily include a proprietary tag that inadvertently allows scripts to run on your page, which is the main problem with the blacklist.


Finally, I highly recommend that you use the HTML DOM library (like the well-known HTML Agility Pack ) for .NET, as opposed to RegEx to do the cleanup / whitelisting, since it will be much more reliable. (It's quite possible to create some pretty crazy messy HTML that can spoof regular expressions! The right HTML reader / writer makes coding the system a lot easier.)

We hope that this should give you a decent overview of what you need for development in order to completely (or at least as much as possible) prevent XSS and how important it is that HTML sanitation is performed taking into account an unknown factor.

+10
source share

As Lee Theobald noted, this is a very dangerous plan. You cannot, by definition, ever create β€œsafe” HTML by filtering / blacklisting, as the user can put material in HTML that you did not think about (or which does not even exist in your browser version, but in others).

The only safe way is a white approach, i.e. separates everything except plain text and certain specific HTML constructs. This, by the way, is what stackoverflow.com does :-).

+4
source share

Here's how I do it using a white listing approach (Javascript and Python code)

https://github.com/dcollien/FilterHTML

I am defining a specification of a subset of valid HTML, and this is just what should go through this filter. There are some options that also allow you to clear URL attributes, only allowing specific schemes (e.g. http :, ftp :, etc.) and disabling those that can cause XSS / Javascript problems (e.g. javascript: or even data: )

edit: This will not give you 100% security out of the box for all situations, but it is used wisely and in combination with several other tricks (for example, checking if URLs are in the same domain and the correct content type, etc.), it may be what you need

+3
source share

If you want html to be changed so that users can see the HTML code itself. Replace the string with all '<', '>', '&' as well as ';'. For example, '<' becomes '& lt.'.

If you want html to work, the easiest way is to remove all HTML and Javascript, and then replace only HTML. Unfortunately, there is almost no surefire way to remove all javascript and allow only HTML.

For example, you can allow images. However, you may not know what you can do.

 <img src='evilscript.js'> 

and he can run this script. He becomes very dangerous very fast $. This is why most sites, such as Wikipedia and this site, use a special markup language. This greatly simplifies formatting, but not malicious javascript.

+2
source share

You can check how some browser-based WYSIWYG editors work, such as TinyMCE . Usually they remove JS and seem to do a convenient job on it.

-one
source share

All Articles