How do scrapers limit execution time?

How do scrapers decide to stop a scheduled run? Does it depend on the actual runtime or processor time? Or maybe something else.

I am cleaning up a site for which Mechanize takes 30 seconds to load each page, but I use very little CPU to process the pages, so I wonder if the problem of server slowness is the main problem.

+4
source share
1 answer

CPU time, not wall clock time. It is based on the Linux setrlimit function.

Each scraper has a limit of approximately 80 seconds of processing time. After that, in Python and Ruby, you will get the ScraperWiki CPU Time Out exception. In PHP, this will end with โ€œcompleted by SIGXCPUโ€.

In many cases, this happens when you first clear the site, catching up with the backlog of existing data. The best way to handle this is to make your scraper at a time using the save_var and get_var functions (see http://scraperwiki.com/docs/python/python_help_documentation/ ) to remember your place.

It also allows you to more easily repair other parsing errors.

+2
source

All Articles