Remove * JS event attributes from HTML tags

Please help parsing simple html lines (php regexp) in PHP. I need to remove html-js events from html code. I know php regular expressions are very bad.

Code Examples:

<button onclick="..javascript instruction..">

Result: <button>

<button onclick="..javascript instruction.." value="..">

Result: <button value="..">

<button onclick=..javascript instruction..>

Result: <button>

<button onclick=..javascript instruction.. value>

Result: <button value>

I need to do this without quotes and with, because all modern browsers allow you to do attributes without vows.

Note. . I have not studied parsing not only onclick .. these are all attributes starting with 'on'.

Note (2): DO NOT ALLOW THE HTML PARSER BOARD BECAUSE IT WILL BE A VERY BIG DOM HOME FOR PARSE ..

UPDATED : Thank you for your reply! Now I use the HTMLPurifier component, which I wrote in a small framework.

+2
2

tokenizing regex. HTML- - . , , , script .

, , on* HTML. , , , , CDATA ..

, /! . .


:

. - , , "" . , , .

: (, ) , , , JS-.

( /) Google Chrome. Firefox, , .

IE 7 , ( , ). (IE 6 - . . XSS Filter Evasion Cheat Sheet)


:


$redefs = '(?(DEFINE)
    (?<tagname> [a-z][^\s>/]*+    )
    (?<attname> [^\s>/][^\s=>/]*+    )  # first char can be pretty much anything, including =
    (?<attval>  (?>
                    "[^"]*+" |
                    \'[^\']*+\' |
                    [^\s>]*+            # unquoted values can contain quotes, = and /
                )
    ) 
    (?<attrib>  (?&attname)
                (?: \s*+
                    = \s*+
                    (?&attval)
                )?+
    )
    (?<crap>    [^\s>]    )             # most crap inside tag is ignored, will eat the last / in self closing tags
    (?<tag>     <(?&tagname)
                (?: \s*+                # spaces between attributes not required: <b/foo=">"style=color:red>bold red text</b>
                    (?>
                        (?&attrib) |    # order matters
                        (?&crap)        # if not an attribute, eat the crap
                    )
                )*+
                \s*+ /?+
                \s*+ >
    )
)';


// removes onanything attributes from all matched HTML tags
function remove_event_attributes($html){
    global $redefs;
    $re = '(?&tag)' . $redefs;
    return preg_replace("~$re~xie", 'remove_event_attributes_from_tag("$0")', $html);
}

// removes onanything attributes from a single opening tag
function remove_event_attributes_from_tag($tag){
    global $redefs;
    $re = '( ^ <(?&tagname) ) | \G \s*+ (?> ((?&attrib)) | ((?&crap)) )' . $redefs;
    return preg_replace("~$re~xie", '"$1$3"? "$0": (preg_match("/^on/i", "$2")? " ": "$0")', $tag);
}


:

$str = '
<button onclick="..javascript instruction..">
<button onclick="..javascript instruction.." value="..">
<button onclick=..javascript_instruction..>
<button onclick=..javascript_instruction.. value>
<hello word "" ontest = "hai"x="y"onfoo=bar/baz  />
';

echo $str . "\n----------------------\n";

echo remove_event_attributes($str);

:

<button onclick="..javascript instruction..">
<button onclick="..javascript instruction.." value="..">
<button onclick=..javascript_instruction..>
<button onclick=..javascript_instruction.. value>
<hello word "" ontest = "hai"x="y"onfoo=bar/baz  />

----------------------

<button >
<button  value="..">
<button >
<button  value>
<hello word "" x="y"   />
+4

, DOMDocument.

DOM, HTML , , *, .

, DOMDocument HTML , - HTML.

+4

All Articles