Can regex simulate lookbehind and lookahead?

I am trying to write a sed script that will grab the entire bare URL in a text file and replace them with <a href=[URL]>[URL]</a> . By "naked" I mean a URL that is not enclosed in an anchor tag.

My initial thought was that I should match a URL that does not have "or a> in front of them, and also after that does not have <or a". However, I am having difficulty expressing the notion of “not having in front of or behind my back,” because, as far as I know, sed does not have a look forward or a look.

Input Example:

 [Beginning of File]http://foo.bar arbitrary text http://test.com other text <a href="http://foobar.com">http://foobar.com</a> Nearing end of file!!! http://yahoo.com[End of File] 

An example of the desired result:

 [Beginning of File]<a href="http://foo.bar">http://foo.bar</a> arbitrary text <a href="http://test.com">http://test.com</a> other text <a href="http://foo.bar">http://foo.bar</a> Nearing end of file!!! <a href="http://yahoo.com">http://yahoo.com</a>[End of File] 

Note that the third line is <a href> because it is already inside <a href> . On the other hand, both the first and second lines change. Finally, note that text without a URL is not modified.

Ultimately, I'm trying to do something like:

 sed s/[^>"](http:\/\/[^\s]\+)/<a href="\1">\1<\/a>/g 2-7-2013 

I started by verifying that the following would match and remove the URL:

 sed 's/http:\/\/[^\s]\+//g' 

Then I tried this, but couldn't match the URL starting at the beginning of the file / input:

 sed 's/[^\>"]http:\/\/[^\s]\+//g' 

Is there a way around this in sed, either by modeling lookbehind / lookahead, or by explicitly matching the beginning of the file and the end of the file?

+8
regex regex-negation awk sed regex-lookarounds
source share
2 answers

sed is a great tool for simple single-line substitutions, for any other text manipulation problems just use awk.

Check out the definition I'm using in the BEGIN section below for a regular expression that matches URLs. It works for your sample, but I don't know if it captures all possible URL formats. Even if it is not, although it may be adequate for your needs.

 $ cat file [Beginning of File]http://foo.bar arbitrary text http://test.com other text <a href="http://foobar.com">http://foobar.com</a> Nearing end of file!!! http://yahoo.com[End of File] $ $ awk -f tst.awk file [Beginning of File]<a href="http://foo.bar">http://foo.bar</a> arbitrary text <a href="http://test.com">http://test.com</a> other text <a href="http://foobar.com">http://foobar.com</a> Nearing end of file!!! <a href="http://yahoo.com">http://yahoo.com</a>[End of File] $ $ cat tst.awk BEGIN{ urlRe="http:[/][/][[:alnum:]._]+" } { head = "" tail = $0 while ( match(tail,urlRe) ) { url = substr(tail,RSTART,RLENGTH) href = "href=\"" url "\"" if (index(tail,href) == (RSTART - 6) ) { # this url is inside href="url" so skip processing it and the next url match. count = 2 } if (! (count && count--)) { url = "<a " href ">" url "</a>" } head = head substr(tail,1,RSTART-1) url tail = substr(tail,RSTART+RLENGTH) } print head tail } 
+4
source share

The obvious problem with your team:

 You did not escape the parenthesis "(" 

This is a weird thing in regex sed . This differs from the Perl regular expression that many characters are literal by default. You must run away from them in a "function". Try:

 s/\([^>"]\?\)\(http:\/\/[^\s]\+\)/\1<a href="\2">\2<\/a>/g 
+1
source share

All Articles