Regexp to search / replace only text, not HTML attribute

I use JavaScript for regular expression. Given that I work with a well-formed source, and I want to remove any space before [,] and save only one space after [,.], Except that [,.] Is part of the number. So I use:

text = text.replace(/ *(,|\.) *([^ 0-9])/g, '$1 $2'); 

The problem is that this also replaces the text in the attributes of the html tag. For example, my text (always tagged):

 <p>Test,and test . Again <img src="xyz.jpg"> ...</p> 

Now it adds a space like this src="xyz. jpg" which is not expected. How can I rewrite my regex? I want

 <p>Test, and test. Again <img src="xyz.jpg"> ...</p> 

Thanks!

+4
source share
6 answers

You can use lookahead to make sure that the match does not occur inside the tag:

 text = text.replace(/(?![^<>]*>) *([.,]) *([^ \d])/g, '$1 $2'); 

Common warnings apply to CDATA sections, SGML, SCRIPT comments, and angle brackets in attribute values. But I suspect that your real problems will arise from the vagaries of "plain" text; HTML is not even in the same league .: D

+4
source

Do not try to rewrite your expression to do this. You will not succeed and will almost certainly forget about some cases in the corner. In the best case, this will lead to unpleasant errors, and in the worst case, you will run into security problems.

Instead, when you are already using JavaScript and have well-formed code, use a genuine XML parser to iterate over text nodes and apply your regular expression to them.

+1
source

If you can access this text through the DOM, you can do this:

 function fixPunctuation(elem) { // check if parameter is a an ELEMENT_NODE if (!(elem instanceof Node) || elem.nodeType !== Node.ELEMENT_NODE) return; var children = elem.childNodes, node; // iterate the child nodes of the element node for (var i=0; children[i]; ++i) { node = children[i]; // check the child's node type switch (node.nodeType) { case Node.ELEMENT_NODE: // call fixPunctuation if it's also an ELEMENT_NODE fixPunctuation(node); break; case Node.TEXT_NODE: // fix punctuation if it's a TEXT_NODE node.nodeValue = node.nodeValue.replace(/ *(,|\.) *([^ 0-9])/g, '$1 $2'); break; } } } 

Now just pass the DOM node to this function as follows:

 fixPunctuation(document.body); fixPunctuation(document.getElementById("foobar")); 
+1
source

Html is not a "regular language", so regular expression is not the best tool for parsing it. Perhaps you are better off using html parser, like this one, to get the attribute , and then apply regex to do something with the value.

Enjoy it!

0
source

Do not parse regex HTML with HTML regex . If you know your HTML is well-formed, use the HTML / XML parser. Otherwise, first run it through Tidy, and then use the XML parser.

0
source

As stated above and many times before, HTML is not a common language and therefore cannot be parsed using regular expressions.

You will have to do it recursively; I suggest going around the DOM object.

Try something like this ...

 function regexReplaceInnerText(curr_element) { if (curr_element.childNodes.length <= 0) { // termination case: // no children; this is a "leaf node" if (curr_element.nodeName == "#text" || curr_element.nodeType == 3) { // node is text; not an empty tag like <br /> if (curr_element.data.replace(/^\s*|\s*$/g, '') != "") { // node isn't just white space // (you can skip this check if you want) var text = curr_element.data; text = text.replace(/ *(,|\.) *([^ 0-9])/g, '$1 $2'); curr_element.data = text; } } } else { // recursive case: // this isn't a leaf node, so we iterate over all children and recurse for (var i = 0; curr_element.childNodes[i]; i++) { regexReplaceInnerText(curr_element.childNodes[i]); } } } // then get the element whose children text nodes you want to be regex'd regexReplaceInnerText(document.getElementsByTagName("body")[0]); // or if you don't want to do the whole document... regexReplaceInnerText(document.getElementById("ElementToRegEx")); 
0
source

All Articles