Can a web scraper get around good throttle protection?

Suppose the data source sets a hard IP-based choke. Will the web scraper load the data in any way if the throttle starts rejecting their requests already up to 1% of the downloaded data?

The only method I could think of for a hacker to use here would be some kind of proxy system. But it seems that the proxies (even if they were fast) eventually reached the throttle.

Update: Some of the people mentioned above mentioned large proxy networks such as Yahoo Pipes and Tor, but could these IP ranges or known exit nodes be blacklisted?

+8
security web-scraping
source share
7 answers

A list of thousands or poxies can be compiled for FREE . IPv6 addresses can be rented for a penny . Hell, an attacker can download a micro-copy of Amazon EC2 for 2-7 cents per hour.

And you want people not to clean your site? The Internet does not work that way, and I hope it never will be.

(I have seen IRC servers perform port checks on clients to see if the following ports are open: 8080, 318, 1080. However, there are proxy servers that use different ports, and there are also legitimate reasons to start a proxy server or to open these ports, for example, if you use Apache Tomcat. You can pick it up using YAPH to find out if the client is running on a proxy server. In fact, you will also use an attacker against them;)

+7
source share

Someone using Tor will retell IP addresses every few minutes. I used to run a website where this was a problem, and resorted to blocking the IP addresses of known Tor exit nodes when excessive curettage was detected. This can be implemented if you can find a regularly updated list of Tor exit nodes, for example https://www.dan.me.uk/tornodes

+2
source share

You can use a P2P scan network to complete this task. There will be many IP addresses available, and there will be no problem if one of them becomes throttled. In addition, you can combine multiple client instances using some proxy configuration, as suggested in previous answers.

I think you can use YaCy , an open source P2P scanning network.

+1
source share

A scraper that wants information will receive information. Timeouts, changing the names of agents, proxies and, of course, EC2 / RackSpace or any other cloud services that have the ability to start and stop servers with new IP addresses for pennies.

+1
source share

I heard that people use Yahoo Pipes to do such things, mostly using Yahoo as a proxy for outputting data.

0
source share

Perhaps try running your scraper on instances of ecazon ec2. Each time you get throttling, start a new instance (with a new IP) and kill the old one.

0
source share

It depends on the time that the attacker has to receive data. If most of the data is static, it might be interesting for an attacker to fire his scraper, say, after 50 days. If he is on a DSL line where he can request a β€œnew” IP address twice a day, a 1% limit will not harm him.

Of course, if you need data faster (because it is outdated quickly), there are better ways (use EC2 instances, set up a BOINC project if there is public interest in the collected data, etc.).

Either there is a Pyramid scheme a la "get 10 people to run your finder, and you get PORNO, or get 100 people to scan it, and you get LOT of porn", as it was quite common a few years ago with Ad- populated websites. Due to participation in the competition (who receives the majority of referrals), you can quickly get many nodes that your crawler works on for very little money.

0
source share

All Articles