I am wondering if there are any methods to identify a web crawler that collects information for illegal use. Simply put, data theft for creating carbon copies of the site.
Ideally, this system will detect a crawl pattern from an unknown source (if not listed with Google crawler, etc.) and send false information to the crawler.
- If, as a defender, I find an unknown crawler that hits the site at regular intervals, the attacker will randomize the intervals.
- If, as a defender, I discover the same agent / IP, the attacker will randomize the agent.
And here I am lost - if an attacker randomizes intervals and agents, how would I not distinguish between proxies and machines that hit a site from the same network?
I am considering checking out a suspicious agent with javascript and cookie support. If a scarecrow cannot do it consistently, then this is a bad guy.
What else can I do? Are there any algorithms or even systems designed to quickly analyze historical data on the fly?
source share