Since the last few days, I have been trying to develop a regular expression that retrieves all external links from web pages provided to it using grep.
Here is my grep command
grep -h -o -e "\(\(mailto:\|\(\(ht\|f\)tp\(s\?\)\)\)\://\)\{1\}\(.*\?\)" "/mnt/websites_folder/folder_to_search" -r
now grep seems to return everything after external links on this line
Example
if the html file contains something like this on the same line
Google
https://yahoo.com '> Yahoo
then this grep command returns the following result
http://www.google.com">Google</a><p><a href='https://yahoo.com'>Yahoo</a></p>
the idea is that if the html file contains more than one link ( irrespective in a,img etc ) on the same line, then the regular expression should extract only the links, and not the entire contents of this line
I managed to develop the same thing in rubular.com regex is as follows
("|')(\b((ht|f)tps?:\/\/)(.*?)\b)("|')
with the above input but I couldn’t repeat the same thing in grep can someone help I can’t change the html file, so don’t ask me to do this, I can’t search for every specific tag and check their attributes to get external links, since this is the processing time of the premium, and my application does not require
thanks
source share