RegEx: does not match a specific character if enclosed in quotation marks

Disclosure: I have read this answer many times here on SO, and I know better than using a regular expression to parse HTML. This question is only to expand my knowledge with regular expressions.

Let's say I have this line:

some text <tag link="fo>o"> other text 

I want to combine the entire tag, but if I use <[^>]+> , it will only match <tag link="fo> .

How can I make sure that > inside quotes can be ignored.

I can trivially write a parser with a while loop for this, but I want to know how to do this with a regular expression.

+7
regex ignore escaping quotes
source share
4 answers

Regular expression:

 <[^>]*?(?:(?:('|")[^'"]*?\1)[^>]*?)*> 

Online demo:

http://regex101.com/r/yX5xS8

Full explanation:

I know that this regular expression can be a headache to watch, so here is my explanation:

 < # Open HTML tags [^>]*? # Lazy Negated character class for closing HTML tag (?: # Open Outside Non-Capture group (?: # Open Inside Non-Capture group ('|") # Capture group for quotes, backreference group 1 [^'"]*? # Lazy Negated character class for quotes \1 # Backreference 1 ) # Close Inside Non-Capture group [^>]*? # Lazy Negated character class for closing HTML tag )* # Close Outside Non-Capture group > # Close HTML tags 
+13
source share

This is a slight improvement in the response of Vasily Sirakis. It handles "…" and '…' completely separately and does not use the qualifier *? .

Regular expression

<[^'">]*(("[^"]*"|'[^']*')[^'">]*)*>

Demo

http://regex101.com/r/jO1oQ1

Explanation

 < # start of HTML tag [^'">]* # any non-single, non-double quote or greater than ( # outer group ( # inner group "[^"]*" # "..." | # or '[^']*' # '...' ) # [^'">]* # any non-single, non-double quote or greater than )* # zero or more of outer group > # end of HTML tag 

This version is slightly better than Vasilis, in that single quotes are allowed inside "…" , and double quotes are allowed inside '…' and that a (incorrect) tag, for example <a href='> , will not be matched.

This is slightly worse than Basil’s decision that the groups were captured. If you don't want this, replace ( with (?: In all places. (Just use ( makes the regex shorter and a little more readable).

+1
source share
 (<.+?>[^<]+>)|(<.+?>) 

you can do two regular expressions than put them togather using '|', in this case:

 (<.+?>[^<]+>) #will match some text <tag link="fo>o"> other text (<.+?>) #will match some text <tag link="foo"> other text 

if the match is the first match, it will not use the second regular expression, so make sure you put the special case in first place.

0
source share

If you want this to work with escaped double quotes, try:

/>(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g

For example:

 const gtExp = />(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g; const nextGtMatch = () => ((exec) => { return exec ? exec.index : -1; })(gtExp.exec(xml)); 

And if you are parsing a bunch of XML, you need to install .lastIndex .

 gtExp.lastIndex = xmlIndex; const attrEndIndex = nextGtMatch(); // the end of the tag attributes 
0
source share

All Articles