How to find all links / pages on a website

Question

How to find all links / pages on a website

Can I find all the pages and links to ANY given website? I would like to enter a URL and create a directory tree of all links from this site?

I looked at HTTrack, but it loads the whole site, and I just need a directory tree.

+67

directory web-crawler

Jonathan Lyon Sep 17 '09 at 14:43

source share

5 answers

Or you can use Google to display all the pages that it has indexed for this domain. For example: site:www.bbc.co.uk

+24

John Magnolia Mar 23 2018-12-12T00:

source share

If you have a developer console (JavaScript) in your browser, you can enter this code in:

 urls = document.querySelectorAll('a'); for (url in urls) console.log(urls[url].href);

Shortened:

 n=$$('a');for(u in n)console.log(n[u].href)

+22

ElectroBit Jan 05 '15 at 22:03

source share

If this is a programming issue, I would suggest you write your own regular expression to parse all the resulting content. Target tags are IMG and A for standard HTML. For JAVA,

 final String openingTags = "(<a [^>]*href=['\"]?|<img[^> ]* src=['\"]?)";

this, together with the Pattern and Matcher classes, should detect the start of tags. Add a LINK tag if you also want to use CSS.

However, it is not as easy as you might think. Many web pages are not well formed. Retrieving all the links programmatically that a person can “recognize” is really difficult if you need to consider all irregular expressions.

Good luck

+1

mizubasho Sep 17 '09 at 15:17

source share

 function getalllinks($url){ $links = array(); if ($fp = fopen($url, 'r')) { $content = ''; while ($line = fread($fp, 1024)) { $content .= $line; } } $textLen = strlen($content); if ( $textLen > 10){ $startPos = 0; $valid = true; while ($valid){ $spos = strpos($content,'<a ',$startPos); if ($spos < $startPos) $valid = false; $spos = strpos($content,'href',$spos); $spos = strpos($content,'"',$spos)+1; $epos = strpos($content,'"',$spos); $startPos = $epos; $link = substr($content,$spos,$epos-$spos); if (strpos($link,'http://') !== false) $links[] = $link; } } return $links; } try this code....

-one

user4318981 Dec 03 '14 at 7:42 on

source share

Hank Gay · Accepted Answer · 2009-09-17 14:51

Check out linkchecker - it will bypass the site (subject to robots.txt ) and generate a report. From there you can script find a solution to create a directory tree.

How to find all links / pages on a website

More articles: