How do scrapers limit execution time?

Question

How do scrapers limit execution time?

How do scrapers decide to stop a scheduled run? Does it depend on the actual runtime or processor time? Or maybe something else.

I am cleaning up a site for which Mechanize takes 30 seconds to load each page, but I use very little CPU to process the pages, so I wonder if the problem of server slowness is the main problem.

+4

scraperwiki

Christophe May 20 '11 at 7:30

source share

1 answer

frabcus · Answer 1 · 2011-05-25T14:44:08+0000

CPU time, not wall clock time. It is based on the Linux setrlimit function.

Each scraper has a limit of approximately 80 seconds of processing time. After that, in Python and Ruby, you will get the ScraperWiki CPU Time Out exception. In PHP, this will end with “completed by SIGXCPU”.

In many cases, this happens when you first clear the site, catching up with the backlog of existing data. The best way to handle this is to make your scraper at a time using the save_var and get_var functions (see http://scraperwiki.com/docs/python/python_help_documentation/ ) to remember your place.

It also allows you to more easily repair other parsing errors.

How do scrapers limit execution time?

More articles: