How do I know if my site is being cleaned?

How do I know if my site is being cleaned?

I have some points ...

  • Network bandwidth dependency causing bandwidth problems (corresponding to using a proxy server).
  • When a search engine searches for keywords, new links appear with other similar resources with the same content (if a proxy server is used).
  • Multiple requests from the same IP address.
  • High speed requests from a single IP address. (by the way: what is a normal rate?)
  • Mute or weird user agent (equivalent to using a proxy server).
  • A request with predictable (equal) intervals from the same IP address.
  • Some support files are never requested, for example. favicon.ico, various CSS and javascript files (matches if using a proxy).
  • The client requests a sequence. Ex. Client access to non-directly accessible pages (matches if a proxy is used).

Could you add this list?

What points can match / match if the scraper uses proxy?

+6
source share
2 answers

As a first note; Consider offering a bot API for the future. If another company / etc. scans you, If this is the information you want to provide them, this makes your site valuable to them. Creating an API will significantly reduce the load on the server and give you 100% clarity for the people who scan you.

Secondly, based on personal experience (I created web crawls a long time ago), as a rule, you can immediately say by monitoring which browser had access to your site. If they use one of the automated or one of the development languages, it will be different from your regular user. Not to mention tracking the log file and updating your .htaccess with their ban (if that is what you are looking for).

Its usually different, which is pretty easy to spot. Repeated, very consistent opening of pages.

Check out this other post for more information on how you can deal with them, as well as some thoughts on how to identify them.

How to block intruders who crawl my site?

+2
source

I would also add analysis when requests from the same people are made. For example, if the same IP address requests the same data at the same time every day, the process is likely to run automatically. Consequently, it will probably clear ...

You can add an analysis of how many pages affected each user session. For example, if a specific user views each page of your site on a certain day, and you consider this unusual, and then, possibly, another indicator.

It seems that you need a number of indicators and you need to score them and combine an assessment to show who is most likely scraping.

+1
source

All Articles