I want to keep good scrapers (in other words, unsuccessful bots that by default ignore robots.txt ) that steal content and consume bandwidth from my site. At the same time, I do not want to interfere with the user experience of legitimate users, as well as stop bots (such as Googlebot) from indexing the site.
The standard method for dealing with this is already described here: Tactics for dealing with the wrong robots . However, the solution presented and supported in this thread is not what I am looking for.
Some bad bots connect via tor or botnets, which means that their IP address is ephemeral and may well belong to a person using a hacked computer.
So I was thinking about how to improve the standard industry method by allowing โfalse positivesโ (that is, people) who have their own blacklist to access my website again. One idea is to stop blocking these IP addresses directly, and instead ask them to transfer CAPTCHAs before allowing access. Although I believe that CAPTCHA is a PITA for legitimate users, checking for suspicious bad bots with CAPTCHA seems to be a better solution than blocking access to these IP addresses completely. By tracking the user session that the CAPTCHA terminates, I should be able to determine if they are human (and must remove their IP address from the blacklist), or robots smart enough to solve the CAPTCHA, putting them on an even more blacklist.
However, before I continue to implement this idea, I want to ask good people here if they anticipate any problems or flaws (I already know that some CAPTCHAs were broken, but I think I can handle this).
source share