HTML Sanitization - Bad Markup?

I was looking at some discussions on disinfecting HTML markup strings to redisplay on a page (e.g. blog comments). Previously, I only unilaterally avoided markup for re-rendering.

Does anyone know if there are any solutions that go beyond just deleting "unsafe" tags?

What to do if markup is invalid? For example, how do you prevent and do not close the <b> tag in bold before all the text that follows it on the page?

Stackoverflow seems to handle this.

Unclosed tag 'b' example

Thanks.

+4
source share
3 answers

Stackoverflow uses textile or something very similar.

Textiles are more or less guaranteed to spill out valid (x) html, improving many of the common problems with user input disinfection.

+4
source

Check this code:

Sanitize HTML , I think StackOverflow uses it somewhere ...

A method for disinfecting any potential hazardous tags from the supplied raw materials. Entering HTML using a whitelist approach, leaving “safe” HTML code tags.

0
source

The Html Agility Pack is probably a good starting point as it claims to be very tolerant of poorly formatted and garbled HTML. In addition, you may want to create some rules for further disinfection. As a result, you serialize the resulting DOM back into regular HTML code.

I ran into the same issue as you and created such a rule-based dev based on the Html Agility Pack. It allows you to smooth or remove tags, convert tags, for example, replace b with strong tags and limit the use of attributes. Take a look at the HtmlRuleSanitizer source code for ideas or just get the NuGet package if you want to do it quickly.

0
source

All Articles