EDIT
To use the HTML purifier HTML.ForbiddenElements config directive, it looks like you would do something like:
require_once '/path/to/HTMLPurifier.auto.php'; $config = HTMLPurifier_Config::createDefault(); $config->set('HTML.ForbiddenElements', array('script','style','applet')); $purifier = new HTMLPurifier($config); $clean_html = $purifier->purify($dirty_html);
http://htmlpurifier.org/docs
HTML.ForbiddenElements should be set to array . I do not know what form the array members should take:
array('script','style','applet')
Or:
array('<script>','<style>','<applet>')
Or something else?
I think this is the first form, without delimiters; HTML.AllowedElements uses the configuration line form, which is somewhat common with TinyMCE valid elements syntax :
tinyMCE.init({ ... valid_elements : "a[href|target=_blank],strong/b,div[align],br", ... });
So, I think this is just a term, and no attributes should be provided (since you are forbidding the element ... although there is HTML.ForbiddenAttributes , too). But this is an assumption.
I will add this note from the HTML.ForbiddenAttributes :
Warning: This directive complements %HTML.ForbiddenElements , so read this directive to discuss why you should think twice before using this directive.
Blacklisting is simply not as βreliableβ as whitelisting, but you may have your own reasons. Just beware and be careful.
Without testing, I'm not sure what to tell you. I will continue to search for an answer, but I will most likely go to bed first. It is too late. :)
Although I think you really should use HTML Purifier and use its HTML.ForbiddenElements , I think that a reasonable alternative, if you really want to use strip_tags() , is to get a whitelist from the blacklist. In other words, delete what you do not want, and then use what is left.
For instance:
function blacklistElements($blacklisted = '', &$errors = array()) { if ((string)$blacklisted == '') { $errors[] = 'Empty string.'; return array(); } $html5 = array( "<menu>","<command>","<summary>","<details>","<meter>","<progress>", "<output>","<keygen>","<textarea>","<option>","<optgroup>","<datalist>", "<select>","<button>","<input>","<label>","<legend>","<fieldset>","<form>", "<th>","<td>","<tr>","<tfoot>","<thead>","<tbody>","<col>","<colgroup>", "<caption>","<table>","<math>","<svg>","<area>","<map>","<canvas>","<track>", "<source>","<audio>","<video>","<param>","<object>","<embed>","<iframe>", "<img>","<del>","<ins>","<wbr>","<br>","<span>","<bdo>","<bdi>","<rp>","<rt>", "<ruby>","<mark>","<u>","<b>","<i>","<sup>","<sub>","<kbd>","<samp>","<var>", "<code>","<time>","<data>","<abbr>","<dfn>","<q>","<cite>","<s>","<small>", "<strong>","<em>","<a>","<div>","<figcaption>","<figure>","<dd>","<dt>", "<dl>","<li>","<ul>","<ol>","<blockquote>","<pre>","<hr>","<p>","<address>", "<footer>","<header>","<hgroup>","<aside>","<article>","<nav>","<section>", "<body>","<noscript>","<script>","<style>","<meta>","<link>","<base>", "<title>","<head>","<html>" ); $list = trim(strtolower($blacklisted)); $list = preg_replace('/[^az ]/i', '', $list); $list = '<' . str_replace(' ', '> <', $list) . '>'; $list = array_map('trim', explode(' ', $list)); return array_diff($html5, $list); }
Then run it:
$blacklisted = '<html> <bogus> <EM> em li ol'; $whitelist = blacklistElements($blacklisted); if (count($errors)) { echo "There were errors.\n"; print_r($errors); echo "\n"; } else {
http://codepad.org/LV8ckRjd
So, if you pass in what you do not want to allow, it will return you a list of HTML5 elements in the form of an array , after which you can pass it to strip_tags() after attaching it to a string:
$stripped = strip_tags($html, implode('', $whitelist)));
Caveat emptor
Now I somehow cracked it together, and I know that there are some problems that I have not thought through yet. For example, from strip_tags() man page for the $allowable_tags argument:
Note:
This parameter must not contain spaces. strip_tags() sees the tag as a case-insensitive string between < and the first space or > . This means that strip_tags("<br/>", "<br>") returns an empty string.
Late and for some reason, I cannot understand what this means for this approach. So I have to think about it tomorrow. I also compiled a list of HTML elements in the $html5 function element on this MDN page. A keen reader may notice that all tags are in this form:
<tagName>
I'm not sure how this will affect the result, whether it is necessary to take into account the variations in the use of shorttag <tagName/> and some of them, oh, more complex variations. And of course there are more tags out there .
So this is probably not ready for production. But you have an idea.