I played with the PHP DOMDocument and its related classes to write an HTML parser that can do such things. It is at a very early stage of development at the moment and nowhere near ready for actual use, but my early experiments seem to show some promise of this idea.
Basically, you load your markup into a DOMDocument, and then traverse the tree. For each node in the tree, you verify that the node type matches the list of valid node types. If the node type is not in the list, it is removed from the tree.
You can use a similar approach to find all SCRIPT tags in a piece of markup and remove them. DSS-based XSS is toothless if you can pull any inline scripts out of the markup you provided.
This is the code I'm using, along with a test case that processes the StackOverflow homepage. As I said, this is far from a quality code of quality and nothing more than a proof of concept. Nevertheless, I hope you find this useful.
<?php class HtmlClean { private $whiteList = array ( '#cdata-section', '#comment', '#text', 'a', 'abbr', 'acronym', 'address', 'b', 'big', 'blockquote', 'body', 'br', 'caption', 'cite', 'code', 'col', 'colgroup', 'dd', 'del', 'dfn', 'div', 'dl', 'dt', 'em', 'fieldset', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'head', 'hr', 'html', 'i', 'img', 'ins', 'kbd', 'li', 'link', 'meta', 'ol', 'p', 'pre', 'q', 'samp', 'small', 'span', 'strike', 'strong', 'style', 'sub', 'sup', 'table', 'tbody', 'td', 'tfoot', 'th', 'thead', 'title', 'tr', 'tt', 'ul', 'var' ); private $attrWhiteList = array ( 'class', 'id', 'title' ); private $dom = NULL; public function getWhiteListTags () { $this -> whiteList = array_values ($this -> whiteList); return ($this -> whiteList); } public function addWhiteListTag ($tagName) { $tagName = strtolower (trin ($tagName)); if (!in_array ($tagName, $this -> whiteList)) { $this -> whiteList [] = $tagName; } } public function removeWhiteListTag ($tagName) { if ($index = array_search ($tagName, $this -> whiteList)) { unset ($this -> whiteList [$index]); } } public function loadHTML ($html) { if (!$this -> dom) { $this -> dom = new DOMDocument(); } $this -> dom -> preserveWhiteSpace = false; $this -> dom -> formatOutput = true; return $this -> dom -> loadHTML ($html); } public function outputHtml () { $ret = ''; if ($this -> dom) { $ret = $this -> dom -> saveXML (); } return ($ret); } private function cleanAttrs (DOMnode $elem) { $attrs = $elem -> attributes; $index = $attrs -> length; while (--$index >= 0) { $attrName = strtolower ($attrs -> item ($indes) -> name); if (!in_array ($attrName, $this -> attrWhiteList)) { $elem -> removeAttribute ($attrName); } } } private function cleanNodes (DOMNode $elem) { $removed = array (); if (in_array (strtolower ($elem -> nodeName), $this -> whiteList)) {