How fast can I crawl a website?

I am going to crawl a website for some information. This is about 170,000 + pages. So how many queries can I make? I am going to extract the HTML code and get some information. This is already a very popular site, so I don’t think that he would die if he just traveled quickly through all the pages ... The only thing that makes me nervous is that I don’t know if the user will block his IP address or something else if you do this? This is normal? Should I just load 5 pages / min? Then it will take a lot of time ... I want to receive new data every 24 hours.

Thanks for the whole answer!

+4
source share
4 answers

This will take some time, in fact I suggest you use rotating proxies and add multithreading. 10 threads will be executed. This way you can have 10 queries at a time. Using proxies will be slow, and add a timeout of at least 1.5 seconds each request, it will slow you down, but reduce the risk of getting a ban.

+5
source

I created a web browser a couple of years ago that scanned about 7 GB at night from the BBC website (bandwidth limited) and never blocked, but adding a 1 second delay between requests is a decent thing.

+2
source

A second or two delays after each request should be sufficient. As soon as possible to make your bot bypass, you can actually deny. At one time, I do sites for several newspapers, and sometimes I see homegrown caterpillars. The bad ones can indeed cause quite a bit of system code and lead to a new IP blacklisting. Do not be this guy.

+1
source

While you are following robots.txt instructions, you should probably be fine. The standard delay that I saw between requests is 2 seconds - this is quite often the limit after which you can start blocking your traffic or blocking ip.

+1
source

All Articles