Regex for finding external links from html file using grep

Question

Regex for finding external links from html file using grep

Since the last few days, I have been trying to develop a regular expression that retrieves all external links from web pages provided to it using grep.

Here is my grep command

grep -h -o -e "\(\(mailto:\|\(\(ht\|f\)tp\(s\?\)\)\)\://\)\{1\}\(.*\?\)" "/mnt/websites_folder/folder_to_search" -r

now grep seems to return everything after external links on this line

Example

if the html file contains something like this on the same line

Google

https://yahoo.com '> Yahoo

then this grep command returns the following result

 http://www.google.com">Google</a><p><a href='https://yahoo.com'>Yahoo</a></p>

the idea is that if the html file contains more than one link ( irrespective in a,img etc ) on the same line, then the regular expression should extract only the links, and not the entire contents of this line

I managed to develop the same thing in rubular.com regex is as follows

 ("|')(\b((ht|f)tps?:\/\/)(.*?)\b)("|')

with the above input but I couldn’t repeat the same thing in grep can someone help I can’t change the html file, so don’t ask me to do this, I can’t search for every specific tag and check their attributes to get external links, since this is the processing time of the premium, and my application does not require

thanks

+4

linux regex grep

Amar Jun 09 '10 at 12:28

source share

2 answers

By default, grep prints the entire line in which a match was found. The -o switch selects only matched parts of the string. See the man page .

+1

wds Jun 09 '10 at 12:38

source share

hudolejev · Accepted Answer · 2010-06-09T12:34:34+0000

Try the following:

~~cat / path / to / file |~~ ~~egrep -o "(mailto | ftp | http (s)?: //) {1} [^ '\"] + "~~

 egrep -o "(mailto|ftp|http(s)?://){1}[^'\"]+" /path/to/file

Prints one link per line. It assumes that each link is inside single or double quotes. To exclude some specific domain references, use -v :

 egrep -o "(mailto|ftp|http(s)?://){1}[^'\"]+" /path/to/file | egrep -v "yahoo.com"

Regex for finding external links from html file using grep

More articles: