How to resolve specific characters using OWASP HTML Sanitizer?

Question

How to resolve specific characters using OWASP HTML Sanitizer?

I use OWASP Html Sanitizer to prevent XSS attacks in my web application. For many fields that should be plain text, Sanitizer does more than I expect.

For instance:

HtmlPolicyBuilder htmlPolicyBuilder = new HtmlPolicyBuilder(); stripAllTagsPolicy = htmlPolicyBuilder.toFactory(); stripAllTagsPolicy.sanitize('a+b'); // return a&#43;b stripAllTagsPolicy.sanitize(' foo@example.com '); // return foo&#64;example.com

When I have fields, such as an email address with + in it, for example foo+bar@gmail.com , I get incorrect data in the database. So, two questions:

Are + - @ characters dangerous in themselves, should they really be encoded?
How to configure sanitizer for OWASP html to allow certain characters, such as + - @?

Question 2 is more important for me to get an answer.

+6

java security xss sanitization owasp

ams 24 sept '12 at 3:26

source share

3 answers

You can use the ESAPI API to filter specific characters. Although, if you want to allow a specific HTML element or attribute, you can use the following allowElements and allowAttributes.

// Define the policy.

 Function<HtmlStreamEventReceiver, HtmlSanitizer.Policy> policy = new HtmlPolicyBuilder() .allowElements("a", "p") .allowAttributes("href").onElements("a") .toFactory(); // Sanitize your output. HtmlSanitizer.sanitize(myHtml, policy.apply(myHtmlStreamRenderer));

+2

Mahendra Nov 17 '14 at 3:02

source share

Honestly, you really should whitelist against all user input. If it's an email address, just use OWASP ESAPI or something to check the input against their Validator and regular expressions email.

If the input passes a whitelist, you should go ahead and store it in the database. When displaying text back, the user should always encode HTML.

Your blacklist approach is not recommended by OWASP and may be bypassed by someone who seeks to attack your users.

+1

bsimic Sep 27 '12 at 12:19

source share

sstendal · Accepted Answer · 2012-09-26T21:31:16+0000

The danger in XSS is that one user can paste html code into their input, which you later paste into a web page that is sent to another user.

Basically, you can follow two strategies if you want to protect against this. You can either remove all dangerous characters from user input when they enter your system, or you can html-encode dangerous characters when you later write them back to the browser.

An example of the first strategy:

User enters data (with html code)

Server deletes all dangerous characters
Modified data is stored in the database
After a while, the server reads the changed data from the database
The server inserts the changed data on the web page to another user.

An example of the second strategy:

User enters data (with html code)
Unmodified data with dangerous characters stored in a database
After some time, the server reads unmodified data from the database
The html server encodes dangerous data and inserts it into a web page to another user.

The first strategy is simpler because you usually read the data you use more often. However, it is also more complicated because it potentially destroys data. It is especially difficult if you need data for something else, except to send it back to the browser later (for example, using an email address to send email). This makes it difficult, for example, to search the database, include data in the report in pdf format, insert data into e-mail, and so on.

Another strategy has the advantage of not destroying the input, so you have more freedom in how you want to use the data later. However, it may be more difficult to verify that you are html-encoding all the data sent by the user that is sent to the browser. The solution to your specific problem will be the html-encode email address when (or if) you ever posted that email address on a web page.

The XSS problem is an example of a more general problem that occurs when mixing user-submitted data and control code. SQL injection is another example of the same problem. The problem is that user-submitted data is interpreted as instructions, not data. The third, less well-known example is if you are mixing data sent by the user in an email. User-submitted data may contain strings that the email server interprets as instructions. The “dangerous character” in this scenario is a line break followed by “From:”.

It would be impossible to check all the input data for all possible control characters or sequences of characters, which in a sense can be interpreted as instructions in some potential application in the future. The only permanent solution to this is to actually sanitize all the data that is potentially dangerous when you actually use that data.

How to resolve specific characters using OWASP HTML Sanitizer?

More articles: