How to find all links / pages on a website

Can I find all the pages and links to ANY given website? I would like to enter a URL and create a directory tree of all links from this site?

I looked at HTTrack, but it loads the whole site, and I just need a directory tree.

+67
directory web-crawler
Sep 17 '09 at 14:43
source share
5 answers

Check out linkchecker - it will bypass the site (subject to robots.txt ) and generate a report. From there you can script find a solution to create a directory tree.

+56
Sep 17 '09 at 14:51
source share

Or you can use Google to display all the pages that it has indexed for this domain. For example: site:www.bbc.co.uk

+24
Mar 23 2018-12-12T00:
source share

If you have a developer console (JavaScript) in your browser, you can enter this code in:

 urls = document.querySelectorAll('a'); for (url in urls) console.log(urls[url].href); 

Shortened:

 n=$$('a');for(u in n)console.log(n[u].href) 
+22
Jan 05 '15 at 22:03
source share

If this is a programming issue, I would suggest you write your own regular expression to parse all the resulting content. Target tags are IMG and A for standard HTML. For JAVA,

 final String openingTags = "(<a [^>]*href=['\"]?|<img[^> ]* src=['\"]?)"; 

this, together with the Pattern and Matcher classes, should detect the start of tags. Add a LINK tag if you also want to use CSS.

However, it is not as easy as you might think. Many web pages are not well formed. Retrieving all the links programmatically that a person can β€œrecognize” is really difficult if you need to consider all irregular expressions.

Good luck

+1
Sep 17 '09 at 15:17
source share
 function getalllinks($url){ $links = array(); if ($fp = fopen($url, 'r')) { $content = ''; while ($line = fread($fp, 1024)) { $content .= $line; } } $textLen = strlen($content); if ( $textLen > 10){ $startPos = 0; $valid = true; while ($valid){ $spos = strpos($content,'<a ',$startPos); if ($spos < $startPos) $valid = false; $spos = strpos($content,'href',$spos); $spos = strpos($content,'"',$spos)+1; $epos = strpos($content,'"',$spos); $startPos = $epos; $link = substr($content,$spos,$epos-$spos); if (strpos($link,'http://') !== false) $links[] = $link; } } return $links; } try this code.... 
-one
Dec 03 '14 at 7:42 on
source share



All Articles