Tasks, Cron, or Backend Jobs for the Application

I'm trying to build a non-trivial GAE application, and I'm not sure if a cron job, tasks, backends, or a combination of all is what I need to use based on the request timeout limit that GAE has for HTTP requests.

I have to follow these steps:

1) I have more than 15,000 sites that I need to display data from a regular schedule and without any user interaction. The total number of sites will not be static, but all of them are stored in the data warehouse [Table0] along with the interval at which they are read. The interval can vary as regular as every day until every 30 days.

2) For each site from step # 1 that meets the pull schedule criteria, I need to extract data from it via HTTP GET (again, it can be all of them or just 2 or 3 sites). As soon as I get a response from the site, analyze the result and save this data in the data warehouse as [Table1].

3) For all the data that has recently been placed in the data warehouse in [Table1] (they will have a special flag), I need to send an additional HTTP request to a third-party site in order to perform some additional processing. As soon as I receive data from this site, I save all the relevant information in another table [Table2] in the data warehouse.

4) As soon as the data is available and ready for step 3, I need to take all this and perform some additional conversion and update the source table [Table1] in the data warehouse.

I'm not sure which of the various components I need to use to ensure that I can complete each part of the work without exceeding the deadline for the response that fits GAE web requests. For requests initiated by cron jobs and tasks, I believe that you are allowed to complete 10 minutes, while typical user-driven requests are allowed 30 seconds.

+7
source share
3 answers

GAE is a complex platform for your use. But, from extreme masochism, I'm trying something like that. So here are my two cents based on my experience:

  • Backends - Use them for any lengthy, intensive I / O tasks you may have (web crawling is a good example, assuming you can defer intensive processing of calculations later).
  • Mapreduce API - great for intensive computing / concurrent tasks like statistics collection, indexing, etc. Until recently, the mapper implementation was implemented in this library, but recently, Google also released in memory Shuffler, which is good for jobs that are approximately 100 MB.
  • Queues of tasks - if all else fails: -).
  • Cron - mainly for running periodic tasks - in what context you execute them, is up to you.

It might be nice to design your backend tasks so that they can be scheduled (manually or perhaps by querying the current quota usage) in the "Frontend" context, using task queues if you have spare Frontend CPU cycles.

+3
source

Target queues are the best way to do this in general, but you can check out the application APIs that are designed for exactly the kind of workflow you're talking about.

+5
source

I abandoned GAE before Backends came out, so I can't comment on this. But I did several times:

  • Cron plans to start the process
  • Cron handler calls task url
  • the task captures the first element (URL) from the data store, performs an HTTP request, works with the data, updates the URL record, working on it, and calls the task URL again.

Thus, cron basically wakes up the task, and the task runs recursively until it reaches some breakpoint.

You can see this in action in one of my GAE public applications - https://github.com/mavenn/watchbots-gae-python .

0
source

All Articles