JQuery: Parse / Manipulate HTML without scripting
I am uploading HTML code through Ajax in this format:
<div id="div1"> ... some content ... </div> <div id="div2"> ...some content... </div> ... etc. I need to iterate over each div in a response and process it separately. Having a separate line for the HTML content of each div mapped to an identifier satisfies my requirements. However, divs may contain script tags that I need to save but not execute (they will be executed later when I attach the HTML to the document, so execution during parsing will be bad). My first thought was to do something like this:
// data being the result from $.get var clean = data.replace(/<script.*?</script>/,function() { // insert some unique token, save the tag, put it back while I'm processing }); $('<div/>').html(clean).children().each( /* ... process here ... */); But I'm worried that some stupid developer is going to come and put something like this in one of the divs:
<script> var foo = '</script>'; // ... </script> That all this would die out. Not to mention that it all starts with hacking. Does anyone know a better way?
EDIT: Here is the solution I came up with:
var divSplitRegex = /(?:^|<\/div>)\s*<div\s+id="prefix-(.+?)">/g, idReplacement = preDelimeter+'$1'+postDelimeter; var r = data.replace(<\/div>\s*$/,''). replace(divSplitRegex,idReplacement).split(preDelimeter); $.each(r,function() { var content; if(this) { callback.apply(null,this.split(postDelimeter)); } }); Where preDelimiter and postDelimeter are only unique lines, such as "### I need to be an idiot to insert this line into my content without saving because it will break all ###", and the callback is a function that expects div id and div. This only works because I know that divs will only have an id attribute, and the identifier will have a special prefix. I suppose someone can put a div in their content with an identifier having the same prefix, and he will blame things too.
So, I still don't like this solution. Does anyone have a better one?
FYI, using unescaped in any JavaScript script causes this problem in the browser. Developers should avoid this, so there is no excuse. That way, you can βtrustβ that it can break anyway.
<body> <div> <script> alert('<script> tags </script> are not '+ 'valid in regular old HTML without being escaped.'); </script> </body> Cm.
to see how it breaks. :)
You might find the alternative approach useful. You can use the following function to prevent JavaScript from starting:
function preventJS(html) { return html.replace(/<script(?=(\s|>))/i, '<script type="text/xml" '); } And it saves script -tags inside the DOM, so scripts can be used later.
I described this method on my blog here - JavaScript: how to prevent JavaScript from executing inside the html being added to the DOM .