How to crawl my site to detect 404/500 errors?

Is there any quick (possibly multi-threaded) way to crawl my site (click on all local links) to search for 404/500 errors (i.e. providing a 200 response)?

I also want to be able to set only one click on 1 of each link type. Therefore, if I have 1000 page categories, she only clicks on one.

Is http://code.google.com/p/crawler4j/ a good option?

I would like something to be very easy to configure, and I would prefer PHP over Java (although if Java is much faster, that would be fine).

+4
source share
3 answers

You can use the old and stable Xenu tool to crawl your site.

You can configure it to use 100 threads and sort the results by status code [500 \ 404 \ 200 \ 403]

+2
source

You can implement this quite easily with any number of open source python projects:

  • The mechanism seems pretty popular
  • Beautiful soup and urlib

You crawl the site using one of these methods and check the server response, which should be fairly simple.

However, if you have a site map (or any list with all your URLs), you can just try and open each of them with cURL or urllib and get an answer without having to scan.

0
source

Define "fast"? how big is your site? cURL would be a good start: http://curl.haxx.se/docs/manual.html

If you don’t have a really huge site and you need to check it on the timeline of seconds, just list the URLs in the list and try each one.

0
source

All Articles