The best solution for placing a scanner?

I have a crawler that scans several different domains for new messages / content. The total amount of content is one hundred thousand pages, and a lot of new content is added every day. Therefore, to crawl all this content, I need my crawler to browse 24 hours a day.

Currently, I host the script crawler on the same server as the site to which the crawler adds content, and I can run cronjob to run the script at night, because when I do this, the website basically stops working. because load script. In other words, a pretty crappy solution.

So basically I'm wondering what is my best option for such a solution?

  • Is it possible to run the scanner from the same host, but somehow balance the load so that the script does not kill the website?

  • What host / server should I look for to start the crawler? Are there any other specs I need than regular web hosting?

  • The crawler saves the images that it scans. If I host my crawler on a secondary server, how do I save my images on the server of my site? I assume that I do not want CHMOD 777 in my uploads folder and allow anyone to host files on my server.

+7
performance webserver web-crawler hosting
source share
1 answer

I decided to choose Amazon Web Services to host my crawler, where both of them have SQS for queues, as well as automatically scalable instances. It also has S3 where I can store all my images.

I also decided to rewrite my entire crawler in Python instead of PHP, in order to more easily use things like queues, and to support the application 100% of the time, instead of using cronjobs.

So what I did and what it means

  • I installed the Elastic Beanstalk application for my crawler installed on "Worker" and listening to SQS, where I store all the domains that need to be crawled. SQS is a "queue" where I can save every domain that needs to be bypassed, and the crawler will listen to the queue and receive one domain at a time until the queue is completed. There is no need for cronjobs or anything like that, as soon as the queue receives data into it, it will send it to the finder. Finder value increases in 100% of cases, 24/7.

  • The application is configured to automatically scale, which means that when I have too many domains in the queue, it will configure the second, third, fourth, etc. instance / finder to speed up the process. I think this is a very very important point for those who want to customize the finder.

  • All images are saved in the S3 instance. This means that images are not stored on the crawler server and can be easily uploaded and processed.

The results were wonderful. When I had a PHP Crawler running on cronjobs every 15 minutes, I could crawl about 600 URLs per hour. Now I can scan 10'000 + urls per hour, even more, depending on how I set autoscaling.

+6
source share

All Articles