How to detect browser spoofing and robots from user agent string in php

Question

How to detect browser spoofing and robots from user agent string in php

So far, I can detect robots from the list of user agent strings by matching these strings with known user agents, but I was wondering what other methods exist for this using php, since I get fewer bots than expected using this method.

I also want to learn how to determine if a browser or a robot of another browser is running using the user agent string.

Any advice is appreciated.

EDIT: This must be done using a log file with lines as follows:

129.173.129.168 - - [11 / Oct / 2011: 00: 00: 05 -0300] "GET / cams / uni_ave2.jpg? Time = 1318302291289 HTTP / 1.1" 200 20240 "http: //faculty.dentistry. Dal.ca /loanertracker/webcam.html "" Mozilla / 5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US; rv: 1.9.2.23) Gecko / 20110920 Firefox / 3.6.23 "

This means that I cannot check user behavior other than access time.

+8

php user-agent bots

user1422508 Nov 14 '12 at 3:57

source share

5 answers

laifukang · Answer 1 · 2012-11-14T04:09:22+0000

In addition to filtering keywords in the user agent line, I was fortunate enough to place a honeypot hidden link on all pages:

<a style="display:none" href="autocatch.php">A</a>

Then in "autocatch.php" write down the session (or IP address) as a bot. This link is invisible to users, but a hidden feature, I hope, will not be implemented by bots. Taking a style attribute and adding it to the CSS file can help even more.

Igal zeifman · Answer 2 · 2012-11-14T11:51:30+0000

Since, as mentioned earlier, you can spoof user agents and IPs, they cannot be used to reliably detect bots.

I work for a security company, and our bot detection algorithm looks something like this:

Step 1 - Data Collection:
but. Cross-validate user agent against IP. (both must be right)
b. Check the header options (what is missing, what order, etc.)
from. Check your behavior (early access and robots.txt compliance, overall behavior, number of pages visited, traffic, etc.)
Step 2 - Classification:
Cross-validation of data, the bot is classified as Good, Bad or Suspicious
Step 3 - Active Tasks:
Suspicious bots have the following problems:
but. JS Challenge (can it activate JS?)
b. Cookie Challenge (can it accept flirts?)
from. If still not final → CAPTCHA

This filtering mechanism is VERY effective, but I really do not think that it can be replicated by one person or even a non-specialized provider (on the one hand, tasks and a bot database must be constantly updated by the security team).

We offer some do-it-yourself tools in the form of Botopedia.org , our catalog that can be used to cross-check IP / User-name, but for a truly effective solution you will have to rely on specialized services.

There are several free solutions for monitoring bots, including our own, and most of them will use the same strategy that I mentioned above (or the like).

GL

Kyros · Answer 3 · 2012-11-14T04:03:32+0000

In addition to simply comparing user agents, you should keep an activity log and look for robot behavior. Often this will include checking /robots.txt, rather than downloading images. Another trick is to ask the client if they have javascript, since most bots will not mark it as included.

However, be careful, you may accidentally get some people who are truly people.

Webchemist · Answer 4 · 2012-11-14T04:14:12+0000

No, user agents can be tampered with, so they should not be trusted.

In addition to checking Javascript downloads or images / css, you can also measure the pageload speed, since bots usually crawl your site much faster than anyone will visit. But this only works on small sites, on popular sites where there would be many visitors behind a common external IP address (a large corporation or campus), you can get to your site at bot-like rates.

I suppose you could also measure the order in which they load when the bots are scanned in the first order of the first crawl, where user users usually don't fit this pattern, but it's a little harder to track

T9b · Answer 5 · 2013-03-18T15:39:26+0000

Your question specifically relates to discovery using a user agent string. As many have said, this can be faked.

To understand what is possible in spoofing, and to understand how difficult it is to find it, you are probably best off learning the art of PHP with cURL.

In essence, using cURL, almost everything that can be sent in a browser (client) request can be faked with a noticeable IP exception, but even here a certain spoiler is also hidden behind a proxy server to exclude your IP detection.

It goes without saying that when using the same parameters every time a request is executed, spoofing detection will be allowed, but rotation with various parameters will make it difficult to detect any spoofers among real traffic logs. / P>

How to detect browser spoofing and robots from user agent string in php

More articles: