Google AppEngine - Big Data Warehouse

I need to read all the entries in the Google AppEngine datastore in order to do some initialization work. Currently, there are many objects (80 thousand), and it continues to grow. I'm starting to use the 30 second storage timeout timeout timeout.

Are there any recommendations on how to trick these types of huge reads into the data warehouse? Any examples?

+4
source share
2 answers

You can solve this in several ways:

  • Run your code on Task Queue , which has a timeout of 10min instead of 30 seconds (more than 60 seconds in practice). The easiest way to do this is DeferredTask .

    Warning : DeferredTask must be serializable, so it is difficult to pass complex data to it. Also do not make this an inner class.

  • See backends . Requests made using the backend instance have no time limit.

  • Finally, if you need to break up a large task and execute it in parallel, than look at mapreduce .

+3
source

This StackExchange answer served me well:

Expired requests and appengine

I had to modify it slightly to work for me:

 def loop_over_objects_in_batches(batch_size, object_class, callback): num_els = object_class.count() num_loops = num_els / batch_size remainder = num_els - num_loops * batch_size logging.info("Calling batched loop with batch_size: %d, num_els: %s, num_loops: %s, remainder: %s, object_class: %s, callback: %s," % (batch_size, num_els, num_loops, remainder, object_class, callback)) offset = 0 while offset < num_loops * batch_size: logging.info("Processing batch (%d:%d)" % (offset, offset+batch_size)) query = object_class[offset:offset + batch_size] for q in query: callback(q) offset = offset + batch_size if remainder: logging.info("Processing remainder batch (%d:%d)" % (offset, num_els)) query = object_class[offset:num_els] for q in query: callback(q) 
0
source

All Articles