Check broken links

I am trying to find all broken links on a webpage using Java. Here is the code:

private static boolean isLive(String link){ HttpURLConnection urlconn = null; int res = -1; String msg = null; try{ URL url = new URL(link); urlconn = (HttpURLConnection)url.openConnection(); urlconn.setConnectTimeout(10000); urlconn.setRequestMethod("GET"); urlconn.connect(); String redirlink = urlconn.getHeaderField("Location"); System.out.println(urlconn.getHeaderFields()); if(redirlink != null && !url.toExternalForm().equals(redirlink)) return isLive(redirlink); else return urlconn.getResponseCode()==HttpURLConnection.HTTP_OK; }catch(Exception e){ System.out.println(e.getMessage()); return false; }finally{ if(urlconn != null) urlconn.disconnect(); } } public static void main(String[] s){ String link = "http://www.somefakesite.net"; System.out.println(isLive(link)); } 

The code is listed at http://nscraps.com/Java/146-program-code-broken-link-checker.htm .

This code provides HTTP status 200 for all web pages, including broken ones. For example, http://www.somefakesite.net/ contains the following header fields:

{null = [HTTP / 1.1 200 OK], Date = [Sun, May 15, 2011 18:51:29 GMT], Transfer-Encoding = [chunked], Keep-Alive = [timeout = 4, max = 100], Connection = [Keep-Alive], Content-Type = [text / html], Server = [Apache / 2.2.15 (Win32) PHP / 5.2.12], X-Powered-By = [PHP / 5.2.9 -1] }

Even if such sites do not exist, how to classify them as a broken link?

+7
source share
1 answer

Perhaps the problem is that many web server and DNS providers are currently detecting these โ€œbrokenโ€ links and redirecting you to their โ€œnot foundโ€ pages.

Test it at the URL that you know will send the 404 code (it shows the original browser message).


EDIT to respond to the author's comment (since it is too long to insert a comment): I do not see an easy answer to your problem, but there are several different types of failures:

  • For redirected DNS crashes (URL that cannot be found by the DNS server and you are redirected to another page). All redirects (if you are redirected) will most likely go to the same page (provided by the ISP / DNS provider), you can check this. Of course, if you try with a different ISP / DNS provider, the page may be different. If you are not redirected, you will receive a connection error.
  • For a server with valid DNS but not working (for example, google.com goes down), there should be a connection error.
  • For a resource (โ€œpageโ€) that is not on the server, this is more complicated. 404 means it is broken, but if the server does not send it, a little more needs to be done. Redirection may be useful for the flag of the link as doubtful, but it should be checked manually later, since it is not only used to capture missing links (for example, www.google.com redirects me to www.google.es)
+4
source

All Articles