Regexp to search / replace only text, not HTML attribute

Question

Regexp to search / replace only text, not HTML attribute

I use JavaScript for regular expression. Given that I work with a well-formed source, and I want to remove any space before [,] and save only one space after [,.], Except that [,.] Is part of the number. So I use:

text = text.replace(/ *(,|\.) *([^ 0-9])/g, '$1 $2');

The problem is that this also replaces the text in the attributes of the html tag. For example, my text (always tagged):

 <p>Test,and test . Again <img src="xyz.jpg"> ...</p>

Now it adds a space like this src="xyz. jpg" which is not expected. How can I rewrite my regex? I want

 <p>Test, and test. Again <img src="xyz.jpg"> ...</p>

Thanks!

+4

javascript html regex

jcisio Aug 11 '10 at 15:24

source share

6 answers

Do not try to rewrite your expression to do this. You will not succeed and will almost certainly forget about some cases in the corner. In the best case, this will lead to unpleasant errors, and in the worst case, you will run into security problems.

Instead, when you are already using JavaScript and have well-formed code, use a genuine XML parser to iterate over text nodes and apply your regular expression to them.

+1

scy Aug 11 '10 at 15:30

source share

If you can access this text through the DOM, you can do this:

 function fixPunctuation(elem) { // check if parameter is a an ELEMENT_NODE if (!(elem instanceof Node) || elem.nodeType !== Node.ELEMENT_NODE) return; var children = elem.childNodes, node; // iterate the child nodes of the element node for (var i=0; children[i]; ++i) { node = children[i]; // check the child's node type switch (node.nodeType) { case Node.ELEMENT_NODE: // call fixPunctuation if it's also an ELEMENT_NODE fixPunctuation(node); break; case Node.TEXT_NODE: // fix punctuation if it's a TEXT_NODE node.nodeValue = node.nodeValue.replace(/ *(,|\.) *([^ 0-9])/g, '$1 $2'); break; } } }

Now just pass the DOM node to this function as follows:

 fixPunctuation(document.body); fixPunctuation(document.getElementById("foobar"));

+1

Gumbo Aug 11 '10 at 15:44

source share

Html is not a "regular language", so regular expression is not the best tool for parsing it. Perhaps you are better off using html parser, like this one, to get the attribute , and then apply regex to do something with the value.

Enjoy it!

0

Doug Aug 11 '10 at 15:29

source share

Do not parse ~~regex~~ HTML with ~~HTML~~ regex . If you know your HTML is well-formed, use the HTML / XML parser. Otherwise, first run it through Tidy, and then use the XML parser.

0

Vivin paliath Aug 11 '10 at 15:29

source share

As stated above and many times before, HTML is not a common language and therefore cannot be parsed using regular expressions.

You will have to do it recursively; I suggest going around the DOM object.

Try something like this ...

 function regexReplaceInnerText(curr_element) { if (curr_element.childNodes.length <= 0) { // termination case: // no children; this is a "leaf node" if (curr_element.nodeName == "#text" || curr_element.nodeType == 3) { // node is text; not an empty tag like <br /> if (curr_element.data.replace(/^\s*|\s*$/g, '') != "") { // node isn't just white space // (you can skip this check if you want) var text = curr_element.data; text = text.replace(/ *(,|\.) *([^ 0-9])/g, '$1 $2'); curr_element.data = text; } } } else { // recursive case: // this isn't a leaf node, so we iterate over all children and recurse for (var i = 0; curr_element.childNodes[i]; i++) { regexReplaceInnerText(curr_element.childNodes[i]); } } } // then get the element whose children text nodes you want to be regex'd regexReplaceInnerText(document.getElementsByTagName("body")[0]); // or if you don't want to do the whole document... regexReplaceInnerText(document.getElementById("ElementToRegEx"));

0

Richard JP Le Guen Aug 11 '10 at 15:33

source share

Alan moore · Accepted Answer · 2010-08-11T22:40:23+0000

You can use lookahead to make sure that the match does not occur inside the tag:

 text = text.replace(/(?![^<>]*>) *([.,]) *([^ \d])/g, '$1 $2');

Common warnings apply to CDATA sections, SGML, SCRIPT comments, and angle brackets in attribute values. But I suspect that your real problems will arise from the vagaries of "plain" text; HTML is not even in the same league .: D

Regexp to search / replace only text, not HTML attribute

More articles: