Regex for finding external links from html file using grep

Since the last few days, I have been trying to develop a regular expression that retrieves all external links from web pages provided to it using grep.

Here is my grep command

grep -h -o -e "\(\(mailto:\|\(\(ht\|f\)tp\(s\?\)\)\)\://\)\{1\}\(.*\?\)" "/mnt/websites_folder/folder_to_search" -r 

now grep seems to return everything after external links on this line

Example

if the html file contains something like this on the same line

Google

https://yahoo.com '> Yahoo

then this grep command returns the following result

 http://www.google.com">Google</a><p><a href='https://yahoo.com'>Yahoo</a></p> 

the idea is that if the html file contains more than one link ( irrespective in a,img etc ) on the same line, then the regular expression should extract only the links, and not the entire contents of this line

I managed to develop the same thing in rubular.com regex is as follows

 ("|')(\b((ht|f)tps?:\/\/)(.*?)\b)("|') 

with the above input but I couldn’t repeat the same thing in grep can someone help I can’t change the html file, so don’t ask me to do this, I can’t search for every specific tag and check their attributes to get external links, since this is the processing time of the premium, and my application does not require

thanks

+4
source share
2 answers

Try the following:

cat / path / to / file | egrep -o "(mailto | ftp | http (s)?: //) {1} [^ '\"] + "

 egrep -o "(mailto|ftp|http(s)?://){1}[^'\"]+" /path/to/file 

Prints one link per line. It assumes that each link is inside single or double quotes. To exclude some specific domain references, use -v :

 egrep -o "(mailto|ftp|http(s)?://){1}[^'\"]+" /path/to/file | egrep -v "yahoo.com" 
+3
source

By default, grep prints the entire line in which a match was found. The -o switch selects only matched parts of the string. See the man page .

+1
source

Source: https://habr.com/ru/post/1312322/


All Articles