What are the requests / second standard for cleaning websites?

This was the closest question to my question, and imo did not answer it very well:

Label etiquette online

I am looking for the answer to # 1:

How many requests / seconds do I need to clear?

I am now exiting the link queue. Each site that gets scraped has its own stream and sleeps for 1 second between requests. I ask gzip compression to save bandwidth.

Are there standards for this? Of course, all major search engines have a number of recommendations that they adhere to regarding this.

+4
source share
3 answers

The Wikipedia article on Internet crawl contains some information on what others are doing:

Cho [22] uses 10 seconds as an interval for access and WIRE crawler [28] uses 15 seconds, as by default. The MercatorWeb scanner follows an adaptive politeness policy: if it took several seconds to download a document from this server, the crawler waits 10 seconds to load the next page. [29] Dill and others. [30] use 1 second.

I usually try 5 seconds with a bit of randomness, so it looks less suspicious.

+4
source

There is no established standard for this, it depends on how much load the web scraping causes. Until you noticeably affect the site speed for other users, this should be an acceptable scrambling speed.

Since the number of users and the load on the website is constantly fluctuating, it would be nice to dynamically adjust the scrambling speed.

Keep track of the page load latency for each page, and if latency starts to increase, start to slow down the scrambling speed. In fact, loading / delaying a website should be inversely proportional to the speed of scrambling.

+2
source

When my clients / boss ask me to do something similar, I usually look for a public API before resorting to cleaning up a public site. Also, by contacting the site owner or technical contact and requesting permission to do so, you can minimize the letters "stop and abstain."

+1
source

Source: https://habr.com/ru/post/1311322/


All Articles