I am working to find a good way to make user data, in this case enable HTML and make it as safe and fast as possible.
I know EVERY ONE MAN on this site seems to be thinking http://htmlpurifier.org . I partially agree. htmlpurifier has the best open source code for filtering user-submitted HTML, but the solution is very cumbersome and not suitable for high traffic performance. I could someday use the solution there, but for now, my goal is to find an easier method.
I have been using the 2 features below for about two and a half years without any problems, but I think it's time to get involved from a professional here if they help me.
The first function is called FilterHTML ($ string) , it is run before the user data is saved in the mysql database. The second function is called format_db_value ($ text, $ nl2br = false) , and I use it on the page where I plan to display the data provided by the user.
Below 2 functions are a bunch of XSS codes that I found at http://ha.ckers.org/xss.html and then I ran them on these 2 functions to see how my code is affective, I'm somewhat pleased with the results, they blocked every code I tried, but I know that it is still not 100% safe.
Can you guys look over it and give me some advice for the code itself or even for the whole html filtering concept.
I would like to do a whitelist someday, but htmlpurifier is the only solution I found useful for this, and as I said it is not as light as I would like.
function FilterHTML($string) { if (get_magic_quotes_gpc()) { $string = stripslashes($string); } $string = html_entity_decode($string, ENT_QUOTES, "ISO-8859-1"); // convert decimal $string = preg_replace('/&#(\d+)/me', "chr(\\1)", $string); // decimal notation // convert hex $string = preg_replace('/&#x([a-f0-9]+)/mei', "chr(0x\\1)", $string); // hex notation //$string = html_entity_decode($string, ENT_COMPAT, "UTF-8"); $string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#U', "$1;", $string); $string = preg_replace('#(<[^>]+[\s\r\n\"\'])(on|xmlns)[^>]*>#iU', "$1>", $string); //$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string); //bad line $string = preg_replace('#/*\*()[^>]*\*/#i', "", $string); // REMOVE /**/ $string = preg_replace('#([az]*)[\x00-\x20]*([\`\'\"]*)[\\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //JAVASCRIPT $string = preg_replace('#([az]*)([\'\"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //VBSCRIPT $string = preg_replace('#([az]*)[\x00-\x20]*([\\\]*)[\\x00-\x20]*@([\\\]*)[\x00-\x20]*i([\\\]*)[\x00-\x20]*m([\\\]*)[\x00-\x20]*p([\\\]*)[\x00-\x20]*o([\\\]*)[\x00-\x20]*r([\\\]*)[\x00-\x20]*t#iU', '...', $string); //@IMPORT $string = preg_replace('#([az]*)[\x00-\x20]*e[\x00-\x20]*x[\x00-\x20]*p[\x00-\x20]*r[\x00-\x20]*e[\x00-\x20]*s[\x00-\x20]*s[\x00-\x20]*i[\x00-\x20]*o[\x00-\x20]*n#iU', '...', $string); //EXPRESSION $string = preg_replace('#</*\w+:\w[^>]*>
Below is a function showing the user-submitted code on a web page.
function format_db_value($text, $nl2br = false) { if (is_array($text)) { $tmp_array = array(); foreach ($text as $key => $value) { $tmp_array[$key] = format_db_value($value); } return $tmp_array; } else { $text = htmlspecialchars(stripslashes($text)); if ($nl2br) { return nl2br($text); } else { return $text; } } }
Below are the codes from ha.ckers.org , and all of them do not seem to work on my functions above
I have not tried everyone on this site, although there are many more, these are just some of them. The source code is on the top line of each set, and the code after performing my functions is on the line below.
<IMG SRC="javascript:alert(\'XSS\');"><b>hello</b> hiii <IMG SRC=...alert('XSS');"><b>hello</b> hiii <IMG SRC=JaVaScRiPt:alert('XSS')> <IMG SRC=...alert('XSS')> <IMG SRC=javascript:alert(String.fromCharCode(88,83,83))> <IMG SRC=...alert(String.fromCharCode(88,83,83))> <IMG SRC=& / svg + xml; base64, PHN2ZyB4bWxuczpzdmc9Imh0dH A6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcv MjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hs aW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAw IiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI + YWxlcnQoIlh TUyIpOzwvc2NyaXB0Pjwvc3ZnPg ==" type = "image / svg + xml" AllowScriptAccess = "always"> </ EMBED> <IMG SRC="javascript:alert(\'XSS\');"><b>hello</b> hiii <IMG SRC=...alert('XSS');"><b>hello</b> hiii <IMG SRC=JaVaScRiPt:alert('XSS')> <IMG SRC=...alert('XSS')> <IMG SRC=javascript:alert(String.fromCharCode(88,83,83))> <IMG SRC=...alert(String.fromCharCode(88,83,83))> <IMG SRC=& / svg + xml; base64, PHN2ZyB4bWxuczpzdmc9Imh0dH A6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcv MjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hs aW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAw IiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI + YWxlcnQoIlh TUyIpOzwvc2NyaXB0Pjwvc3ZnPg ==" type = "image / svg + xml" AllowScriptAccess = "always"> </ EMBED> <IMG SRC="javascript:alert(\'XSS\');"><b>hello</b> hiii <IMG SRC=...alert('XSS');"><b>hello</b> hiii <IMG SRC=JaVaScRiPt:alert('XSS')> <IMG SRC=...alert('XSS')> <IMG SRC=javascript:alert(String.fromCharCode(88,83,83))> <IMG SRC=...alert(String.fromCharCode(88,83,83))> <IMG SRC=&