How to remove only html tags in a string using javascript

I want to remove html tags from a given string using javascript. I looked at the current approaches, but there were unresolved issues with them.

Current solutions

(1) Using javascript, creating a virtual div tag and getting text

function remove_tags(html) { var tmp = document.createElement("DIV"); tmp.innerHTML = html; return tmp.textContent||tmp.innerText; } 

(2) Using regex

  function remove_tags(html) { return html.replace(/<(?:.|\n)*?>/gm, ''); } 

(3) Using jQuery

  function remove_tags(html) { return jQuery(html).text(); } 

These three solutions work correctly, but if the line looks like this

  <div> hello <hi all !> </div> 

the split line is like hello . But I only need to remove the html tags. e.g. hello <hi all !>

Edited: Background is, I want to remove all user input html tags for a specific text area. But I want to allow users to enter the text <hi all> . In the current approach, it removes any content that it contains inside <>.

+7
source share
6 answers

Using regex may not be a problem if you are considering a different approach. For example, look for all the tags, and then check to see if the tag name matches the list of specific, valid HTML tag names:

 var protos = document.body.constructor === window.HTMLBodyElement; validHTMLTags =/^(?:a|abbr|acronym|address|applet|area|article|aside|audio|b|base|basefont|bdi|bdo|bgsound|big|blink|blockquote|body|br|button|canvas|caption|center|cite|code|col|colgroup|data|datalist|dd|del|details|dfn|dir|div|dl|dt|em|embed|fieldset|figcaption|figure|font|footer|form|frame|frameset|h1|h2|h3|h4|h5|h6|head|header|hgroup|hr|html|i|iframe|img|input|ins|isindex|kbd|keygen|label|legend|li|link|listing|main|map|mark|marquee|menu|menuitem|meta|meter|nav|nobr|noframes|noscript|object|ol|optgroup|option|output|p|param|plaintext|pre|progress|q|rp|rt|ruby|s|samp|script|section|select|small|source|spacer|span|strike|strong|style|sub|summary|sup|table|tbody|td|textarea|tfoot|th|thead|time|title|tr|track|tt|u|ul|var|video|wbr|xmp)$/i; function sanitize(txt) { var // This regex normalises anything between quotes normaliseQuotes = /=(["'])(?=[^\1]*[<>])[^\1]*\1/g, normaliseFn = function ($0, q, sym) { return $0.replace(/</g, '&lt;').replace(/>/g, '&gt;'); }, replaceInvalid = function ($0, tag, off, txt) { var // Is it a valid tag? invalidTag = protos && document.createElement(tag) instanceof HTMLUnknownElement || !validHTMLTags.test(tag), // Is the tag complete? isComplete = txt.slice(off+1).search(/^[^<]+>/) > -1; return invalidTag || !isComplete ? '&lt;' + tag : $0; }; txt = txt.replace(normaliseQuotes, normaliseFn) .replace(/<(\w+)/g, replaceInvalid); var tmp = document.createElement("DIV"); tmp.innerHTML = txt; return "textContent" in tmp ? tmp.textContent : tmp.innerHTML; } 

Working demo: http://jsfiddle.net/m9vZg/3/

This works because browsers parse '>' as text if it is not part of the match '<' opening tag. It does not suffer the same problems as when trying to parse HTML tags using a regular expression, because you are looking for only the opening delimiter and tag name, everything else does not matter.

This is also future proof : the WebIDL specification tells vendors how to implement prototypes for HTML elements, so we are trying to create an HTML element from the current matching tag. If the element is an instance of HTMLUnknownElement , we know that it is not a valid HTML tag. The validHTMLTags regular expression defines a list of HTML tags for older browsers, such as IE 6 and 7, that do not implement these prototypes.

+7
source

If you want to keep invalid markup intact, regular expressions are your best bet. Maybe something like this:

  text = html.replace(/<\/?(span|div|img|p...)\b[^<>]*>/g, "") 

Expand (span|div|img|p...) to the list of all tags (or just the ones you want to remove). NB: the list should be sorted by length, longer tags first!

This may lead to incorrect results in some cases (for example, attributes with <> characters), but the only real alternative may be programming a full html analyzer. Not that it was extremely difficult, but it might be redundant here. Let us know.

+3
source
 var StrippedString = OriginalString.replace(/(<([^>]+)>)/ig,""); 
+1
source

Here is my solution

 function removeTags(){ var txt = document.getElementById('myString').value; var rex = /(<([^>]+)>)/ig; alert(txt.replace(rex , "")); } 
0
source

I use regex to prevent HTML tags in my text area

Example

 <form> <textarea class="box"></textarea> <button>Submit</button> </form> <script> $(".box").focusout( function(e) { var reg =/<(.|\n)*?>/g; if (reg.test($('.box').val()) == true) { alert('HTML Tag are not allowed'); } e.preventDefault(); }); </script> 
0
source
 <script type="text/javascript"> function removeHTMLTags() { var str="<html><p>I want to remove HTML tags</p></html>"; alert(str.replace(/<[^>]+>/g, '')); }</script> 
0
source

All Articles