Associate links in the <a> tag with a regular expression
I need to wrap all links in the text with an "a" tag with a regular expression in php, except for those that have already been damaged
So I have the text:
Some text with html herehttp://www.somelink.htmlhttp://www.somelink.com/view/?id=95<a href="http://anotherlink.html">http://anotherlink.html</a><a href="http://anotherlink.html">Title</a>
What I need to get:
Some text with html here<a href="http://www.somelink.html">http://www.somelink.html</a>
<a href="http://www.somelink.com/view/?id=2495">http://www.somelink.com/view/?id=95</a><a href="http://anotherlink.html">http://anotherlink.html</a><a href="http://anotherlink.html">Title</a>
>
I can match the links with this expression:
(?:(?:https?|ftp):\/\/|www.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|]
but it also matches those already in the tags
For reliability, I would split into tags <a>(including child content) plus other tags (excluding child content), for example:
$bits = preg_split('/(<a(?:\s+[^>]*)?>.*?<\/a>|<[a-z][^>]*>)/is', $content, null, PREG_SPLIT_DELIM_CAPTURE);
$reconstructed = '';
foreach ($bits as $bit) {
if (strpos($bit, '<') !== 0) {//not inside an <a> or within < and > so check for urls
$bit = link_urls($bit);
}
$reconstructed .= $bit;
}
You would use a negative lookbehind . Syntax:
(?<!text)
So in your case it will be:
(?<!\<a)
Or something close to the above.
( perl). .
use strict;
use warnings;
my $html = '
http://Top.html
Some text with more html here
<a href="http://www.somelink.html">
http://www.somelink.html
</a>
<a href="http://www.somelink.com/view/?id=2495">
http://www.somelink.com/view/?id=95
</a>
<a href="http://anotherlink.html">
http://anotherlink.html
</a>
http://andone.html
http://andtwo.html
<a href="http://anthisisotherlink.html"><mn>
Title
http://this <br>
<b href="http://erlink.html">
asdf
</a>
';
{
no warnings;
$html =~
# Regex (global relace) ..
s{(?is)
(< (?:DOCTYPE.*?|--.*?--)
| script\s[^>]*>.*?</script\s*
| style\s[^>]*>.*?</style\s*
| a\s[^>]*>.*?</a\s*
| (?:/?\w+\s*/?|(?:\w+\s+(?:".*?"|'.*?'|[^>]*?)+\s*/?))
>
)
| ( (?:
(?!(?:(?:https?|ftp)://|www.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|])
[^<]
)*?
)
| ( (?:(?:https?|ftp)://|www.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|] )
}
# Replacement (would be a callback function in php) ..
{
defined $3 ? "<a href=\"$3\">$3</a>" : "$1$2"
}xeg;
}
print $html,"\n";
<a href="http://Top.html">http://Top.html</a>
Some text with more html here
<a href="http://www.somelink.html">
<a href="http://www.somelink.html">http://www.somelink.html</a>
</a>
<a href="http://www.somelink.com/view/?id=2495">
http://www.somelink.com/view/?id=95
</a>
<a href="http://anotherlink.html">
http://anotherlink.html
</a>
<a href="http://andone.html">http://andone.html</a>
<a href="http://andtwo.html">http://andtwo.html</a>
<a href="http://anthisisotherlink.html"><mn>
Title
http://this <br>
<b href="http://erlink.html">
asdf
</a>