How to block bad bots from my site without interfering with real users?

Question

How to block bad bots from my site without interfering with real users?

I want to keep good scrapers (in other words, unsuccessful bots that by default ignore robots.txt ) that steal content and consume bandwidth from my site. At the same time, I do not want to interfere with the user experience of legitimate users, as well as stop bots (such as Googlebot) from indexing the site.

The standard method for dealing with this is already described here: Tactics for dealing with the wrong robots . However, the solution presented and supported in this thread is not what I am looking for.

Some bad bots connect via tor or botnets, which means that their IP address is ephemeral and may well belong to a person using a hacked computer.

So I was thinking about how to improve the standard industry method by allowing “false positives” (that is, people) who have their own blacklist to access my website again. One idea is to stop blocking these IP addresses directly, and instead ask them to transfer CAPTCHAs before allowing access. Although I believe that CAPTCHA is a PITA for legitimate users, checking for suspicious bad bots with CAPTCHA seems to be a better solution than blocking access to these IP addresses completely. By tracking the user session that the CAPTCHA terminates, I should be able to determine if they are human (and must remove their IP address from the blacklist), or robots smart enough to solve the CAPTCHA, putting them on an even more blacklist.

However, before I continue to implement this idea, I want to ask good people here if they anticipate any problems or flaws (I already know that some CAPTCHAs were broken, but I think I can handle this).

+4

performance apache robots.txt captcha bots

Free radical Jan 9 '13 at 21:45

source share

1 answer

Rami · Answer 1 · 2013-11-09T20:42:41+0000

The question I consider is whether there are predictable problems with captcha. Before diving into this, I also want to dwell on how you plan to catch bots in order to challenge them. The TOR and proxy nodes change regularly, so the IP list must be constantly updated. You can use Maxmind for a decent list of proxy addresses as a baseline. You can also find services that update the addresses of all TOR nodes. But not all bad bots come from these two vectors, so you need to find other ways to catch bots. If you add speed limit and spam lists, you should get more than 50% of the bad bots. Another tactic really needs to be set up on your site.

Now let's talk about problems with Captchas. Firstly, there are services such as http://deathbycaptcha.com/ . I don’t know if I need to dwell on this in detail, but it’s kind of like your approach is useless. Many other ways people get around Captcha use OCR software. The better Captcha is in beating OCR, the harder it will be for your users. In addition, many Captcha systems use client-side cookies, which someone can solve once, and then upload to all of their bots. The best known, I think, is Karl Grove's list of 28 ways to defeat Defense. http://www.karlgroves.com/2013/02/09/list-of-resources-breaking-captcha/

For full disclosure, I co-founded Distil Networks , a SaaS bot blocking solution. I often transfer our software as a more complex system than just using captcha and creating it myself, so my opinion about the effectiveness of your solution is biased.

How to block bad bots from my site without interfering with real users?

More articles: