Googlebot Unexplained 32-character hexadecimal string added, causing more than 20,000 404 errors per day

I have a very interesting problem that I cannot explain.

Every 2-6 seconds googlebot (I searched googlebots IP, its real thing [using host IP]) requests a page on our website (works: php, apache, mongodb) that does not exist (404s), No other robot or man never requested such a page! Just googlebot.

Each request looks something like this:

/ 2de4f853c2853807b2e72387aa8928a4

/ ea5700c343d1a9798bc554af7c1a330e

/ e5aafa102d54ba7517703336846cc019

Our code does not use 32 char strings, and there are no links like our internal or external sites. We use codeigniter, so at first I thought it was session_id by default, I checked it is not.

Has anyone seen anything like this? Our site uses history.push on some pages, could this be the reason for this? Just an idea.

Raw data from an example query:

array ( 'date' => '2012-12-01', 'time' => '10:01:33 PM', 'additional_data' => array ( 'server_vars' => array ( 'REDIRECT_STATUS' => '200', 'HTTP_HOST' => 'www.xxxxxxx.com', 'HTTP_ACCEPT' => '*/*', 'HTTP_ACCEPT_ENCODING' => 'gzip,deflate', 'HTTP_FROM' => 'googlebot(at)googlebot.com', 'HTTP_USER_AGENT' => 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)', 'HTTP_X_FORWARDED_FOR' => 'xxxxxxx', 'HTTP_X_FORWARDED_PORT' => '80', 'HTTP_X_FORWARDED_PROTO' => 'http', 'HTTP_CONNECTION' => 'keep-alive', 'PATH' => '/sbin:/usr/sbin:/bin:/usr/bin:/home/ec2-user/ec2/bin', 'SERVER_SIGNATURE' => '<address>Apache/2.2.22 (Amazon) Server at www.xxxxxxx.com Port 80</address> ', 'SERVER_SOFTWARE' => 'Apache/2.2.22 (Amazon)', 'SERVER_NAME' => 'www.xxxxxxx.com', 'SERVER_ADDR' => 'xxxxxxxxxx', 'SERVER_PORT' => '80', 'REMOTE_ADDR' => '10.171.147.114', 'REMOTE_PORT' => '40759', 'REDIRECT_URL' => '/e5aafa102d54ba7517703336846cc019', 'GATEWAY_INTERFACE' => 'CGI/1.1', 'SERVER_PROTOCOL' => 'HTTP/1.1', 'REQUEST_METHOD' => 'GET', 'QUERY_STRING' => '', 'REQUEST_URI' => '/e5aafa102d54ba7517703336846cc019', 'SCRIPT_NAME' => '/index.php', 'PATH_INFO' => '/e5aafa102d54ba7517703336846cc019', 'PATH_TRANSLATED' => 'redirect:/index.php/e5aafa102d54ba7517703336846cc019', 'PHP_SELF' => '/index.php/e5aafa102d54ba7517703336846cc019', 'REQUEST_TIME' => 1354428093, ), 'codeigiter_session' => array ( 'session_id' => 'c795e40a279f58d9fbbf7f5501a26787', 'ip_address' => '10.171.147.114', 'user_agent' => 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)', 'last_activity' => 1354428093, 'user_data' => '', ), ), ) 

What else can I collect to understand this. It is very strange.


Update: Traffic comes from 2 primary IP addresses. 10.171.147.114 and 10.161.46.102

I looked through them and they are not GoogleBot.

I got this information from one IP search site.

Remember that the IP address ranges 10.0.0.0 - 10.255.255.255, 172.16.0.0 - 172.31.255.255, 192.168.0.0 - 192.168.255.255 and 224.0.0.0 - 239.255.255.255 reserved IP addresses for private use on the Internet and search by IP addresses for them will not return any results.

What should I do with these queries? What is the meaning of these queries? If this is a type of DOS attack, they do a very poor job.

+6
source share
2 answers

To answer this question, the problem was created by aws load blancer health checks. For some reason aws uses user_agent googlebot to execute them on our servers.

+1
source

The first thing to do is to collect as many IP addresses as possible and find the answer to two questions: 1. Can you group them by network, for example, 66.249.66.XXX or 66.249.XXX.XXX? If you can’t, it’s not Gbot 2. What are the countries of these IP addresses? If you have dozens, this is not Gbot.

I think this is not like Google Bot, because they have no tendency to monitor the site, even without a sitemap with this frequency (except for some special cases, such as news sites).

Talk to

http://support.google.com/webmasters/bin/answer.py?hl=en&answer=80553

to learn how to recognize gbot. Try some google bot ip online listings. They may be outdated, but still give you information about address clusters. Moreover, google bot ips are easily grouped by network.

You cannot trust HTTP_USER_AGENT because a third party can easily fake it.

I would say that your site is under a separate attack from a network.

I doubt they are trying to guess PHP_SESSID by sending this hash. The only reason PHP_SESSID appears in the URL is because you configured PHP not to store it in cookies (I think you didn’t). It’s easier and more natural to send session_id to cookies, even during an attack.

Check the POST and PRINT parameters they send. This may give you more information.

-1
source

All Articles