Identification of hostile web crawlers

Question

Identification of hostile web crawlers

I am wondering if there are any methods to identify a web crawler that collects information for illegal use. Simply put, data theft for creating carbon copies of the site.

Ideally, this system will detect a crawl pattern from an unknown source (if not listed with Google crawler, etc.) and send false information to the crawler.

If, as a defender, I find an unknown crawler that hits the site at regular intervals, the attacker will randomize the intervals.
If, as a defender, I discover the same agent / IP, the attacker will randomize the agent.

And here I am lost - if an attacker randomizes intervals and agents, how would I not distinguish between proxies and machines that hit a site from the same network?

I am considering checking out a suspicious agent with javascript and cookie support. If a scarecrow cannot do it consistently, then this is a bad guy.

What else can I do? Are there any algorithms or even systems designed to quickly analyze historical data on the fly?

+4

web-crawler screen-scraping

Andrei Taranchenko May 30, '09 at 16:04

source share

3 answers

Do not try to recognize by IP and time or intervals - use the data that you send to the crawler to track them.

Create a white list of well-known good scanners - you will usually serve them. For the rest, serve pages with an extra bit of unique content that you only know how to look for. Use this signature to later determine who has copied your content and blocked it.

+2

sj2009 May 30, '09 at 16:10

source share

And how do you deter someone from hiring a person in a low-wage country to use a browser to access your site and record all the information? Set up a robots.txt file, invest in a security infrastructure to prevent DoS attacks, obfuscate your code (if available, for example, javascript), stock up on your inventions and protect the copyrights to your site. Let legal people worry that someone tore you up.

+2

tvanfosson May 30, '09 at 16:12

source share

tomjen · Accepted Answer · 2009-05-30T16:11:53+0000

My solution would be to make a trap. Put some pages on your site where access is denied by robots.txt. Create a link on your page, but hide it using CSS, then ip forbid anyone who goes to this page.

This will force the offender to obey robots.txt, which means that you can permanently leave important information or services that will make his clown copy machine useless.

Identification of hostile web crawlers

More articles: