How do I know if a web request is coming from a Google crawler?

From the point of view of the HTTP server.

+4
source share
3 answers

I logged a google crawler request in my asp.net application and this is what the Google crawler signature looks like.

IP request : 66.249.71.113
Client : Mozilla / 5.0 (compatible; Googlebot / 2.1; + http://www.google.com/bot.html )

My logs track many different IP addresses for a Google crawler in the range 66.249.71.* . All of these IP addresses are located in a geographic area in Mountain View, CA, USA.

A good solution is to check if the request comes from a Google crawler, to check that the request contains Googlebot and http://www.google.com/bot.html . As I said, there are many IP addresses that are observed on the same requesting client, I would not recommend checking IP addresses. And maybe when the customer ID comes into the picture. Therefore, go on to verify client identification.

Here is a sample code in C #.

  if (Request.UserAgent.ToLower().Contains("googlebot") || Request.UserAgent.ToLower().Contains("google.com/bot.html")) { //Yes, it google bot. } else { //No, it something else. } 

It is important to note that any HTTP client can easily fake this.

+5
source

You can read the official Googlebot Page Checkout page .

Quoting the page here:

You can verify that the bot accessing your server is indeed a Googlebot (or other Google user agent) using a reverse DNS lookup, confirming that the name is in the googlebot.com domain, and then directly looking for DNS using that name googlebot This is useful if you are concerned that spammers or other troublemakers have your site claiming to be a Googlebot.

For instance:

 > host 66.249.66.1 1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com. > host crawl-66-249-66-1.googlebot.com crawl-66-249-66-1.googlebot.com has address 66.249.66.1 

Google does not publish a public list of IP addresses for webmasters in the white list. This is because these IP addresses can change their address range, causing problems for any webmasters that are hardcoded. The best way to identify Googlebot access is to use a user agent (Googlebot).

+7
source

If you are using the Apache web server, you can see the log file "log \ access.log".

Then download google IP addresses from http://www.iplists.com/nw/google.txt and check if one of the IP addresses is in your log.

0
source

Source: https://habr.com/ru/post/1316535/


All Articles