Can a page scraper be detected?

Question

Can a page scraper be detected?

So, I just created an application that does a page cleanup for me, and launched it. It worked fine. I was wondering if anyone would be able to understand that the code was copied, regardless of whether he wrote the code for this purpose?

I wrote the code in java and it pretty much just checks one line of html code.

I thought I got some idea about this before adding code to this program. I mean, this is useful, and that’s it, but it's almost like hacking.

It seems that the worst case scenario as a result of this page scraper is not so bad as I can just use another device later and the IP will be different. It may also not matter in a month. At the moment, the website seems to be getting quite a lot of web traffic. The person who is editing the page is probably asleep now, and in fact he has not achieved anything at this moment, so this may go unnoticed.

Thanks for such quick answers. I think this could go unnoticed. All I did was copy the title, so just the text. It probably looks like the browser copy paste works. This page has just been edited this morning, including the text I was trying to get. If they noticed something, they did not announce it, so everything is fine.

+3

java html web-scraping

Slayer0248 Aug 4 '11 at 5:09

source share

7 answers

As a system administrator himself, yes, I probably would have noticed, but ONLY based on client behavior. If the client had a strange user agent, I would be suspicious. If a client browsed the site too quickly or at very predictable intervals, I would be suspicious. If some support files were never requested (favicon.ico, various related in CSS and JS files), I would be suspicious. If the client was accessing odd (not accessible) pages, I would be suspicious.

Then again I would have to look at my magazines. And this week Slashdot was especially interesting, so I probably wouldn't have noticed.

+5

Chris eberle Aug 4 '11 at 5:20

source share

It depends on how you implemented it and how smart the detection tools are.

Take care of the user agent first. If you don't install it explicitly, it will be something like "Java-1.6." Browsers send their "unique" user agents, so you can simply simulate browser behavior and send User-Agents from MSIE or FireFox (for example).

Second, check out the other HTTP headers. Probably some browsers send their specific headers. Take one example and follow it i.e. Try adding headers to your queries (even if you don't need them).

A person acts relatively slowly. The robot can act very quickly, that is, get the page, and then "click", i.e. Run another HTTP GET. Put a random dream between these operations.

The browser extracts not only the main HTML. He then uploads images and other materials. If you really do not want to be discovered, you need to parse the HTML and download this material, that is, actually be a “browser”.

And the last one. This is obviously not your business, but it is almost impossible to implement the robot that Capcha passes. This is another way to detect a robot.

Happy hack!

+1

Alexr Aug 4 '11 at 5:24

source share

If your scraper acts as a person, then it is unlikely that it will be detected as a scraper. But if your scraper acts like a robot, then it is not difficult to detect.

To act as a person, you need:

See what the browser sends in the HTTP headers and simulates them.
See what the browser asks when accessing the page and accessing it with a scraper
Time for your scraper to access at the speed of a regular user
Send requests at random intervals, rather than at specific intervals.
If it is possible to make requests from a dynamic IP, and not from a static one,

+1

manubkk Aug 4 '11 at 5:25

source share

provided that you wrote the page scraper in the usual way, that is, it extracts the entire page and then recognizes pattern recognition to extract everything you want from the page, all that can say is that the page was extracted by a robot, and not a normal browser. all of their magazines will show that the entire page has been selected; they cannot say what you do with it when it is in your memory.

0

jcomeau_ictx Aug 4 '11 at 5:13

source share

On the server serving the page, it makes no difference whether you download the page to a browser or download the page and clear it. Both actions require only an HTTP request, no matter what you do with the resulting HTML at your end, this is not a server business.

Having said that, a complex server can apparently detect activity that does not look like a normal browser. For example, the browser should request any additional resources related to the page, which usually does not happen when cleaning the screen. Or requests with an unusual frequency coming from a specific address. Or just the header of the HTTP HTTP user.

Whether the server is trying to detect these things or not depends on the server, most of them are down.

0

deceze Aug 4 '11 at 5:15

source share

I would like to put my two cents on others who can read this. Over the past couple of years, the web scraper has been increasingly condemned by the judicial system. I cited many examples in a blog post that I recently wrote.

You must follow the robots.txt file as well as look at the T & C websites to make sure that you are not breaking the rules. There are certain ways that people can identify you, this is web scraping, and there could be potential consequences. In case the web scraper is not prohibited by the Website Terms of Use, then have fun, but make sure that you are still considered inappropriate. Do not destroy the web server with the control bot, throttle yourself to make sure that you do not affect the server!

For full disclosure, I co-founded Distil Networks , and we help companies identify and stop web scrapers and bots.

-2

Rami Oct 21 '13 at 16:13

source share

Daniel Lyons · Accepted Answer · 2011-08-04T05:12:10+0000

This is a hack. :)

It is not possible to determine programmatically whether the page is being cleared. But, if your scraper becomes popular or you use it too much, it is possible to detect the statistical scraper. If you see that one IP captures the same page or pages at the same time every day, you can make an educated guess. The same if you see requests for a different timer.

You should try to obey the robots.txt file, if you can, and limit the speed of yourself to be polite.

Can a page scraper be detected?

More articles: