Running a spider (webcrawler) to search for specific content

Question

Running a spider (webcrawler) to search for specific content

First of all, I do not know if this place is right for this question. If not, I'm sorry :)

I’m thinking of writing a spider to crawl a web search for specific embedded files.

However, I was wondering if the ISP is allowed to run the spider because it will make many requests at a fast pace.

Or do I need to delay some delay in requests?

I read the contract of my provider, but I could not find anything specific about the workaround.

+4

web-crawler

Peeehaa Dec 05 '10 at 15:43

source share

2 answers

There is nothing to prevent you from crawling. It is no different from normal user interaction. If you open a page with a large number of images, the browser makes many requests at once.

You may have a transfer limit - just note how much data you have uploaded.

The thing you have to consider is that crawling a large number of pages can be considered a DoS attack or prohibited by the page operator. Follow their rules. If they require that no more than N requests are executed daily from one computer, follow them. Make some delays so as not to block access to the site.

+1

Danubian sailor Dec 29 '10 at 14:59

source share

peter.murray.rust · Accepted Answer · 2010-12-05T16:07:13+0000

You can watch wget . He came up with useful ideas. You should take note of ROBOTS.txt on the sites you want to run. And you must leave a delay between requests so as not to create a denial of service condition.

Running a spider (webcrawler) to search for specific content

More articles: