Google Scholar crawl

I am trying to get information about a large number of scientific articles as part of my research. The number of articles is about thousands. Since Google Scholar does not have an API, I'm trying to heal / scan a scientist. Now I now that it is technically against EULA, but I try to be very polite and reasonable in this regard. I understand that Google does not allow bots to maintain reasonable traffic. I started with a test batch of ~ 500 hundred requests with 1 s between each request. I was blocked after the first 100 requests. I tried several other strategies, including:

  • Expanding pauses to ~ 20 s and adding some random noise to them
  • Pauses are logarithmically distributed (so most pauses are on the order of seconds, but from time to time a pause of several minutes or more lasts)
  • Performance of long pauses (several hours) between blocks of requests (~ 100).

I doubt that at this point my script adds any significant traffic to what any person will have. But anyway, I always get blocked after ~ 100-200 requests. Does anyone know of a good strategy to overcome this (I don't care if it takes weeks until it is automated). Also, does anyone have the opportunity to communicate directly with Google and ask for permission to do something similar (for research, etc.)? Is it worth writing and explaining what I'm trying to do and how, and see if I can get permission for my project? And how can I contact them? Thank you

+6
source share
1 answer

Without testing, I'm still sure one of the following: trick:

  • Light but small chance of success:

    Delete all cookies from the site in question after each rand (0,100) request,
    then change your user agent, accepted language, etc. and repeat.

  • A bit more work, but a much more durable spider:

    Send your requests via Tor, other proxies, mobile networks, etc., to mask your IP ( also make offer 1 at every step )

Update on selenium I missed the fact that you are using Selenium, it was taken for granted that it was some kind of modern programming language (I know that Selenium can be controlled by the most widely used languages, but also as a kind of browser plugin requiring very little programming skills).

How then do I assume that your coding skills are not (or weren't?) Stunning, but for others with the same limitations when using Selenium, my answer is either to learn a simple, scripting language (PowerShell ?!) or JavaScript (since it located on the Internet ;-)) and from there from there.

If smoothing automation was as easy as a browser plugin, the web interface should be a much messier, confusing, and trustworthy place.

+1
source

All Articles