I'm trying to build a non-trivial GAE application, and I'm not sure if a cron job, tasks, backends, or a combination of all is what I need to use based on the request timeout limit that GAE has for HTTP requests.
I have to follow these steps:
1) I have more than 15,000 sites that I need to display data from a regular schedule and without any user interaction. The total number of sites will not be static, but all of them are stored in the data warehouse [Table0] along with the interval at which they are read. The interval can vary as regular as every day until every 30 days.
2) For each site from step # 1 that meets the pull schedule criteria, I need to extract data from it via HTTP GET (again, it can be all of them or just 2 or 3 sites). As soon as I get a response from the site, analyze the result and save this data in the data warehouse as [Table1].
3) For all the data that has recently been placed in the data warehouse in [Table1] (they will have a special flag), I need to send an additional HTTP request to a third-party site in order to perform some additional processing. As soon as I receive data from this site, I save all the relevant information in another table [Table2] in the data warehouse.
4) As soon as the data is available and ready for step 3, I need to take all this and perform some additional conversion and update the source table [Table1] in the data warehouse.
I'm not sure which of the various components I need to use to ensure that I can complete each part of the work without exceeding the deadline for the response that fits GAE web requests. For requests initiated by cron jobs and tasks, I believe that you are allowed to complete 10 minutes, while typical user-driven requests are allowed 30 seconds.
Dan holman
source share