]*?>|[^<>]*?<\/) ...">

Javascript regex: find all urls outside <a> tags - nested tags

I built this regex code:

((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*?>|[^<>]*?<\/) 

The first group captures all the links in HTML, and the second negative, to exclude any parts inside the tags as attributes and any parts inside the tags as content.

I would like the <a> tags to be excluded, so only the last member could change the solution:

 [^<>]*?<\/a> 

But now there will be a problem if I have nested tags, for example, <b></b> inside <a> .

Here is an example I'm working on: https://regex101.com/r/lM3hC5/6 (there should be 10 matches).

A negative look is still difficult for me. I thought the following should work, but it is not:

 (?!<a.+?<\/a>) 

https://regex101.com/r/hT1cG5/1

These are the latest discussions that have helped me:

+2
source share
1 answer

It turned out that perhaps the best solution is the following:

 ((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a) 

It seems that a negative lookahead only works correctly if it starts with quantifiers , not a string. In this case, it follows from this that in practice we can only make digressions.

Again, we just want to make sure that nothing inside the HTML tags as attributes is corrupted. Then we do a backtrack from </a to the first character " (since it is not a valid URL character, but <> characters are present with nested tags).

Now also found nested tags inside the <a> tags. Of course, the code is not perfect, but it should work with almost any simple HTML markup. You may just need to be a little careful:

  • quotation marks in the <a> tags;
  • do not use this algorithm in <a> tags without any attribute ( placeholders );
  • and you may need to avoid using multiple nested tags / lines if the URL inside the <a> tag is not specified after double quoting.


Here is a very nice and dirty example (the last match should not be found, but it is):

https://regex101.com/r/pC0jR7/2

Too bad this lookahead doesn't work: (?!<a.*?<\/a>)

+2
source

All Articles