Make Web Crawler / Spider

I'm looking for a web caterpillar / spider creation, but I need someone to point me in the right direction to get started.

Basically, my spider is going to search for sound files and index them.

I'm just wondering if anyone has any ideas how I should do this. I heard that this was done in PHP very slowly. I know vb.net, so this can come in handy?

I was thinking about using Googles filetype search to get crawl links. Will it be okay?

+5
source share
3 answers

In VB.NET, you need to get the HTML code first, so use the WebClient or HttpWebRequest and HttpWebResponse classes. There is a lot of information on how to use them in an interface.

HTML. .

Google . PDF PDF SharePoint, .

+2

The pseudocode should look like this:

Method spider(URL startURL){ 
 Collection URLStore; // Can be an arraylist  
    push(startURL,URLStore);// start with a know url
       while URLStore ! Empty do 
         currURL= pop(URLStore); //take an url
         download URL page;
        push (URLx, URLStore); //for all links to URL in the page which are not already followed, then put in the list

To read some data from a web page in Java, you can:

URL myURL = new URL("http://www.w3.org"); 
 BufferedReader in =  new BufferedReader( new InputStreamReader(myURL.openStream())); 
 String inputLine; 
 while ((inputLine = in.readLine()) != null) //you will get all content of the page
 System.out.println(inputLine); //  here you need to extract the hyperlinks
 in.close();
0
source

All Articles