Is my anti-XSS method OK to allow the use of HTML in PHP?

I am working to find a good way to make user data, in this case enable HTML and make it as safe and fast as possible.

I know EVERY ONE MAN on this site seems to be thinking http://htmlpurifier.org . I partially agree. htmlpurifier has the best open source code for filtering user-submitted HTML, but the solution is very cumbersome and not suitable for high traffic performance. I could someday use the solution there, but for now, my goal is to find an easier method.

I have been using the 2 features below for about two and a half years without any problems, but I think it's time to get involved from a professional here if they help me.

The first function is called FilterHTML ($ string) , it is run before the user data is saved in the mysql database. The second function is called format_db_value ($ text, $ nl2br = false) , and I use it on the page where I plan to display the data provided by the user.

Below 2 functions are a bunch of XSS codes that I found at http://ha.ckers.org/xss.html and then I ran them on these 2 functions to see how my code is affective, I'm somewhat pleased with the results, they blocked every code I tried, but I know that it is still not 100% safe.

Can you guys look over it and give me some advice for the code itself or even for the whole html filtering concept.

I would like to do a whitelist someday, but htmlpurifier is the only solution I found useful for this, and as I said it is not as light as I would like.

function FilterHTML($string) { if (get_magic_quotes_gpc()) { $string = stripslashes($string); } $string = html_entity_decode($string, ENT_QUOTES, "ISO-8859-1"); // convert decimal $string = preg_replace('/&#(\d+)/me', "chr(\\1)", $string); // decimal notation // convert hex $string = preg_replace('/&#x([a-f0-9]+)/mei', "chr(0x\\1)", $string); // hex notation //$string = html_entity_decode($string, ENT_COMPAT, "UTF-8"); $string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#U', "$1;", $string); $string = preg_replace('#(<[^>]+[\s\r\n\"\'])(on|xmlns)[^>]*>#iU', "$1>", $string); //$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string); //bad line $string = preg_replace('#/*\*()[^>]*\*/#i', "", $string); // REMOVE /**/ $string = preg_replace('#([az]*)[\x00-\x20]*([\`\'\"]*)[\\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //JAVASCRIPT $string = preg_replace('#([az]*)([\'\"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //VBSCRIPT $string = preg_replace('#([az]*)[\x00-\x20]*([\\\]*)[\\x00-\x20]*@([\\\]*)[\x00-\x20]*i([\\\]*)[\x00-\x20]*m([\\\]*)[\x00-\x20]*p([\\\]*)[\x00-\x20]*o([\\\]*)[\x00-\x20]*r([\\\]*)[\x00-\x20]*t#iU', '...', $string); //@IMPORT $string = preg_replace('#([az]*)[\x00-\x20]*e[\x00-\x20]*x[\x00-\x20]*p[\x00-\x20]*r[\x00-\x20]*e[\x00-\x20]*s[\x00-\x20]*s[\x00-\x20]*i[\x00-\x20]*o[\x00-\x20]*n#iU', '...', $string); //EXPRESSION $string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string); $string = preg_replace('#</?t(able|r|d)(\s[^>]*)?>#i', '', $string); // strip out tables $string = preg_replace('/(potspace|pot space|rateuser|marquee)/i', '...', $string); // filter some words //$string = str_replace('left:0px; top: 0px;','',$string); do { $oldstring = $string; //bgsound| $string = preg_replace('#</*(applet|meta|xml|blink|link|script|iframe|frame|frameset|ilayer|layer|title|base|body|xml|AllowScriptAccess|big)[^>]*>#i', "...", $string); } while ($oldstring != $string); return addslashes($string); } 

Below is a function showing the user-submitted code on a web page.

 function format_db_value($text, $nl2br = false) { if (is_array($text)) { $tmp_array = array(); foreach ($text as $key => $value) { $tmp_array[$key] = format_db_value($value); } return $tmp_array; } else { $text = htmlspecialchars(stripslashes($text)); if ($nl2br) { return nl2br($text); } else { return $text; } } } 

Below are the codes from ha.ckers.org , and all of them do not seem to work on my functions above

I have not tried everyone on this site, although there are many more, these are just some of them. The source code is on the top line of each set, and the code after performing my functions is on the line below.

 <IMG SRC="javascript:alert(\'XSS\');"><b>hello</b> hiii <IMG SRC=...alert('XSS');"><b>hello</b> hiii <IMG SRC=JaVaScRiPt:alert('XSS')> <IMG SRC=...alert('XSS')> <IMG SRC=javascript:alert(String.fromCharCode(88,83,83))> <IMG SRC=...alert(String.fromCharCode(88,83,83))> <IMG SRC=&#106;&#97;&#118;&#97;&#115;&#99;&#114;&#105;&#112;&#116;&#58;&#97;&#108;&#101;&#114;&#116;&#40;&#39;&#88;&#83;&#83;&#39;&#41;> <IMG SRC=...alert('XSS')> <IMG SRC=&#0000106&#0000097&#0000118&#0000097&#0000115&#0000099&#0000114&#0000105&#0000112&#0000116&#0000058&#0000097&#0000108&#0000101&#0000114&#0000116&#0000040&#0000039&#0000088&#0000083&#0000083&#0000039&#0000041> <IMG SRC=F MLEJNALN !> <IMG SRC=&#x6A&#x61&#x76&#x61&#x73&#x63&#x72&#x69&#x70&#x74&#x3A&#x61&#x6C&#x65&#x72&#x74&#x28&#x27&#x58&#x53&#x53&#x27&#x29> <IMG SRC=...alert('XSS')> <IMG SRC="jav&#x0A;ascript:alert('XSS');"> <IMG SRC=...alert('XSS');"> perl -e 'print "<IMG SRC=javascript:alert("XSS")>";' > out perl -e 'print "<IMG SRC=java\0script:alert(\"XSS\")>";' > out <BODY onload!#$%&()*~+-_.,:; ?@ [/|\]^`=alert("XSS")> ... <iframe src=http://ha.ckers.org/scriptlet.html < ... <LAYER SRC="http://ha.ckers.org/scriptlet.html"></LAYER> ...... <META HTTP-EQUIV="Link" Content="<http://ha.ckers.org/xss.css>; REL=stylesheet"> ...; REL=stylesheet"> <IMG STYLE="xss:...(alert('XSS'))"> <IMG STYLE="xss:expr/*XSS*/ession(alert('XSS'))"> <XSS STYLE="xss:...(alert('XSS'))"> <XSS STYLE="xss:expression(alert('XSS'))"> <EMBED SRC=" A6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcv MjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hs aW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAw IiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI+YWxlcnQoIlh TUyIpOzwvc2NyaXB0Pjwvc3ZnPg==" type="image/svg+xml" AllowScriptAccess="always"></EMBED> <EMBED SRC=" A6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcv MjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hs aW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAw IiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI+YWxlcnQoIlh TUyIpOzwvc2NyaXB0Pjwvc3ZnPg==" type="image/svg+xml" AllowScriptAccess="always"></EMBED> <IMG SRC = " j a v a s c r i p t : a l e r t ( ' X S S ' ) " > <IMG SRC =... a l e r t ( ' X S S ' ) " > / svg + xml; base64, PHN2ZyB4bWxuczpzdmc9Imh0dH A6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcv MjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hs aW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAw IiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI + YWxlcnQoIlh TUyIpOzwvc2NyaXB0Pjwvc3ZnPg ==" type = "image / svg + xml" AllowScriptAccess = "always"> </ EMBED> <IMG SRC="javascript:alert(\'XSS\');"><b>hello</b> hiii <IMG SRC=...alert('XSS');"><b>hello</b> hiii <IMG SRC=JaVaScRiPt:alert('XSS')> <IMG SRC=...alert('XSS')> <IMG SRC=javascript:alert(String.fromCharCode(88,83,83))> <IMG SRC=...alert(String.fromCharCode(88,83,83))> <IMG SRC=&#106;&#97;&#118;&#97;&#115;&#99;&#114;&#105;&#112;&#116;&#58;&#97;&#108;&#101;&#114;&#116;&#40;&#39;&#88;&#83;&#83;&#39;&#41;> <IMG SRC=...alert('XSS')> <IMG SRC=&#0000106&#0000097&#0000118&#0000097&#0000115&#0000099&#0000114&#0000105&#0000112&#0000116&#0000058&#0000097&#0000108&#0000101&#0000114&#0000116&#0000040&#0000039&#0000088&#0000083&#0000083&#0000039&#0000041> <IMG SRC=F MLEJNALN !> <IMG SRC=&#x6A&#x61&#x76&#x61&#x73&#x63&#x72&#x69&#x70&#x74&#x3A&#x61&#x6C&#x65&#x72&#x74&#x28&#x27&#x58&#x53&#x53&#x27&#x29> <IMG SRC=...alert('XSS')> <IMG SRC="jav&#x0A;ascript:alert('XSS');"> <IMG SRC=...alert('XSS');"> perl -e 'print "<IMG SRC=javascript:alert("XSS")>";' > out perl -e 'print "<IMG SRC=java\0script:alert(\"XSS\")>";' > out <BODY onload!#$%&()*~+-_.,:; ?@ [/|\]^`=alert("XSS")> ... <iframe src=http://ha.ckers.org/scriptlet.html < ... <LAYER SRC="http://ha.ckers.org/scriptlet.html"></LAYER> ...... <META HTTP-EQUIV="Link" Content="<http://ha.ckers.org/xss.css>; REL=stylesheet"> ...; REL=stylesheet"> <IMG STYLE="xss:...(alert('XSS'))"> <IMG STYLE="xss:expr/*XSS*/ession(alert('XSS'))"> <XSS STYLE="xss:...(alert('XSS'))"> <XSS STYLE="xss:expression(alert('XSS'))"> <EMBED SRC=" A6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcv MjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hs aW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAw IiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI+YWxlcnQoIlh TUyIpOzwvc2NyaXB0Pjwvc3ZnPg==" type="image/svg+xml" AllowScriptAccess="always"></EMBED> <EMBED SRC=" A6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcv MjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hs aW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAw IiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI+YWxlcnQoIlh TUyIpOzwvc2NyaXB0Pjwvc3ZnPg==" type="image/svg+xml" AllowScriptAccess="always"></EMBED> <IMG SRC = " j a v a s c r i p t : a l e r t ( ' X S S ' ) " > <IMG SRC =... a l e r t ( ' X S S ' ) " > / svg + xml; base64, PHN2ZyB4bWxuczpzdmc9Imh0dH A6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcv MjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hs aW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAw IiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI + YWxlcnQoIlh TUyIpOzwvc2NyaXB0Pjwvc3ZnPg ==" type = "image / svg + xml" AllowScriptAccess = "always"> </ EMBED> <IMG SRC="javascript:alert(\'XSS\');"><b>hello</b> hiii <IMG SRC=...alert('XSS');"><b>hello</b> hiii <IMG SRC=JaVaScRiPt:alert('XSS')> <IMG SRC=...alert('XSS')> <IMG SRC=javascript:alert(String.fromCharCode(88,83,83))> <IMG SRC=...alert(String.fromCharCode(88,83,83))> <IMG SRC=&#106;&#97;&#118;&#97;&#115;&#99;&#114;&#105;&#112;&#116;&#58;&#97;&#108;&#101;&#114;&#116;&#40;&#39;&#88;&#83;&#83;&#39;&#41;> <IMG SRC=...alert('XSS')> <IMG SRC=&#0000106&#0000097&#0000118&#0000097&#0000115&#0000099&#0000114&#0000105&#0000112&#0000116&#0000058&#0000097&#0000108&#0000101&#0000114&#0000116&#0000040&#0000039&#0000088&#0000083&#0000083&#0000039&#0000041> <IMG SRC=F MLEJNALN !> <IMG SRC=&#x6A&#x61&#x76&#x61&#x73&#x63&#x72&#x69&#x70&#x74&#x3A&#x61&#x6C&#x65&#x72&#x74&#x28&#x27&#x58&#x53&#x53&#x27&#x29> <IMG SRC=...alert('XSS')> <IMG SRC="jav&#x0A;ascript:alert('XSS');"> <IMG SRC=...alert('XSS');"> perl -e 'print "<IMG SRC=javascript:alert("XSS")>";' > out perl -e 'print "<IMG SRC=java\0script:alert(\"XSS\")>";' > out <BODY onload!#$%&()*~+-_.,:; ?@ [/|\]^`=alert("XSS")> ... <iframe src=http://ha.ckers.org/scriptlet.html < ... <LAYER SRC="http://ha.ckers.org/scriptlet.html"></LAYER> ...... <META HTTP-EQUIV="Link" Content="<http://ha.ckers.org/xss.css>; REL=stylesheet"> ...; REL=stylesheet"> <IMG STYLE="xss:...(alert('XSS'))"> <IMG STYLE="xss:expr/*XSS*/ession(alert('XSS'))"> <XSS STYLE="xss:...(alert('XSS'))"> <XSS STYLE="xss:expression(alert('XSS'))"> <EMBED SRC=" A6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcv MjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hs aW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAw IiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI+YWxlcnQoIlh TUyIpOzwvc2NyaXB0Pjwvc3ZnPg==" type="image/svg+xml" AllowScriptAccess="always"></EMBED> <EMBED SRC=" A6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcv MjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hs aW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAw IiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI+YWxlcnQoIlh TUyIpOzwvc2NyaXB0Pjwvc3ZnPg==" type="image/svg+xml" AllowScriptAccess="always"></EMBED> <IMG SRC = " j a v a s c r i p t : a l e r t ( ' X S S ' ) " > <IMG SRC =... a l e r t ( ' X S S ' ) " > 
+3
source share
4 answers
+2
source

The only way to ensure this is to whitelist tags and attributes that they can use and write strict regular expressions to check for valid attribute values. If you want to allow attributes such as "style", then you have additional complexity.

Only a blacklist can make an attack harder for some people, but it wonโ€™t make it difficult for a person who uses a technique that you have not heard about.

I would try using regexp to add the missing closing tags to what users entered, and replace <br> with <br /> , etc., then parse it with SimpleXML, then iterate over it and remove any tag that doesn't is in the white list, any attribute that is not in the white list for this tag, and any attribute that has a value that matches the exact regular expression for this attribute. In the end, I use asXML () to return the text. I started with a minimal set of tags and attributes and added new ones, if necessary, taking extra care of everything the URL might contain.

+3
source

IMHO htmlawed is the best - fast, fast, full HTML coverage, the most flexible ... black OR white list for tags and attributes. Safely? Defeats all Xa codes. ha.ckers

0
source

How to use your own PHP PHP parser?

I was interested, so I wrote code for testing (requires PHP 5.3.6 +):

 $badHtml = file_get_contents('badHtml.txt'); $html = sprintf('<div id="input">%s</div>', $badHtml); // tidy is no required, but may fix invalid markup $tidy = new \tidy(); $tidy->parseString($html, array(), 'utf8'); $tidy->cleanRepair(); $dom = new \DomDocument('1.0', 'UTF-8'); libxml_use_internal_errors(true); $dom->loadHtml($tidy); $input = $dom->getElementById('input'); // tag as key, attributes as values $allowed = array( 'table' => array('border'), 'tbody' => array(), 'tr' => array(), 'td' => array(), 'th' => array(), 'img' => array('src', 'alt'), 'p' => array(), 'ul' => array(), 'ol' => array(), 'li' => array(), 'a' => array('href', 'title'), 'strong' => array(), 'em' => array(), 'sub' => array(), 'sup' => array(), ); $walk = function(\DomNode $node) use($allowed, &$walk){ // only check tags if($node->nodeType !== XML_ELEMENT_NODE) return; if(!isset($allowed[$node->nodeName])) return $node->parentNode->removeChild($node); foreach($node->attributes as $key => $attr){ if(!in_array($key, $allowed[$node->nodeName], true)) $node->removeAttribute($key); // expect URLs here if(!in_array($key, array('href', 'src'), true)) continue; if(!filter_var($attr->value, FILTER_VALIDATE_URL)) return $node->parentNode->removeChild($node); } array_map($walk, iterator_to_array($node->childNodes)); }; // convert DOMNodeList to array because this way the bad stuff // can be removed within the loop array_map($walk, iterator_to_array($input->childNodes)); // export HTML $sanitized = $dom->saveHtml($input); 

Exit without starting Tidy:

enter image description here

Everything seems to be in order. Or did he delete too much? :) It should be faster than HTMLPurifier, theoretically safer because it is less permissive and probably faster than regular expressions.

0
source

All Articles