PHP - regular expression to remove all occurrences of event attributes

after hours of trying, I'm here to ask. I want to remove all occurrences of js event attributes and style attribute from POSTed text. it may or may not contain newlines .

Sample text with text:

<a href="http://www.google.com" onclick="unwanted_code" style="unwanted_style" ondblclick="unwanted_code" onmouseover="unwanted_code">google</a> is a search engine. There are other engines too. <a href="http://www.yahoo.com" onclick="unwanted_code" ondblclick="unwanted_code" onmouseover="unwanted_code" style="unwanted_style">yahoo</a> is another engine.

first try:

$pattern[0] = '/(<[^>]+) on.*=".*?"/iU';
$replace[0] = '$1';
$pattern[1] = '/(<[^>]+) style=".*?"/iU';
$replace[1] = '$1';
$out = preg_replace($pattern, $replace, $in);

exit:

<a href="http://www.google.com">yahoo</a> is another engine.

second attempt:

$out = preg_replace_callback('/(<[^>]+) on.*=".*?"/iU', function($m) {return $m[1];}, $in);

exit:

<a href="http://www.google.com">yahoo</a> is another engine.

The output I'm trying to get is:

<a href="http://www.google.com">google</a> is a search engine. There are other engines too. <a href="http://www.yahoo.com">yahoo</a> is another engine.

Does anyone help me?

+4
source share
3 answers

What about:

$content = '<a href="http://www.google.com" onclick="unwanted_code" style="unwanted_style" ondblclick="unwanted_code" onmouseover="unwanted_code">google</a> is a search engine. There are other engines too. <a href="http://www.yahoo.com" onclick="unwanted_code" ondblclick="unwanted_code" onmouseover="unwanted_code" style="unwanted_style">yahoo</a> is another engine.';

$result = preg_replace('%(<a href="[^"]+")[^>]+(>)%m', "$1$2", $content);
echo $result,"\n";

exit:

<a href="http://www.google.com">google</a> is a search engine. There are other engines too. <a href="http://www.yahoo.com">yahoo</a> is another engine.
+3
source

, , , ; :

$doc->loadHTML('<html><body>' . $html . '</body></html>');

$allowedTags = ['a' => ['href']];

$body = $doc->getElementsByTagName('body')->item(0);

$elements = $body->getElementsByTagName('*');
for ($k = 0; $element = $elements->item($k); ) {
    $name = strtolower($element->nodeName);
    if (isset($allowedTags[$name])) {
        $allowedAttributes = $allowedTags[$name];
        for ($i = 0; $attribute = $element->attributes->item($i); ) {
            if (!in_array($attribute->nodeName, $allowedAttributes)) {
                $element->removeAttribute($attribute->nodeName);
                continue;
            }
            ++$i;
        }
    } else {
        $element->parentNode->removeChild($element);
        continue;
    }
    ++$k;
}

$result = '';

foreach ($body->childNodes as $childNode) {
    $result .= $doc->saveXML($childNode);
}

echo $result;
+2

Since you want to keep the attribute (href), you cannot delete all of them. With this code, you can achieve what you want, but with a list of all the unwanted attributes:

preg_replace('#(onclick|style|ondblclick|onmouseover)="[^"]+"#', '', $in);

It might be simple, but it just works :)

0
source

All Articles