We have a reasonable size database in the Google App Engine - more than 50,000 objects in total - we want to clear outdated data. The plan was to write a deferred task , to iterate over objects that we no longer wanted, and delete them in batches.
One of the complications is that our entities also have child entities that we also want to purify โ no problem, we thought; we just query the data store for these objects and drop them at the same time as the parent:
query = ParentKind.all() query.count(100) query.filter('bar =', 'foo') to_delete = [] for entity in enumerate(query): to_delete.append(entity) to_delete.extend(ChildKindA.all().ancestor(entity).fetch(100)) to_delete.extend(ChildKindB.all().ancestor(entity).fetch(100)) db.delete(to_delete)
We limited ourselves to deleting 100 ParentKind objects at a time; each ParentKind had about 40 ChildKindA and ChildKindB objects in total - maybe 4000 objects.
At the time, this seemed reasonable, but we performed one batch as a test, and the received request took 9 seconds to start - and spent 1933 seconds on the processorโs paid time, accessing the data warehouse.
It seems pretty harsh - 0.5 paid seconds per entity! - but we are not quite sure what we are doing wrong. Is it just a lot size? Are ancestors often asked? Or does it delete (and indeed, all accesses to the data warehouse) just slowly, like molasses?
Update
We changed our requests to keys_only , and although this reduced the launch time of one batch to 4.5 real seconds, it still costs ~ 1900 seconds in CPU time.
Then we installed Appstats in our application (thanks, kevpie) and launched smaller child objects - 10 parent objects, which would be ~ 450 entities. Here's the updated code:
query = ParentKind.all(keys_only=True) query.count(10) query.filter('bar =', 'foo') to_delete = [] for entity in enumerate(query): to_delete.append(entity) to_delete.extend(ChildKindA.all(keys_only=True).ancestor(entity).fetch(100)) to_delete.extend(ChildKindB.all(keys_only=True).ancestor(entity).fetch(100)) db.delete(to_delete)
Appstats Results:
service.call #RPCs real time api time datastore_v3.RunQuery 22 352ms 555ms datastore_v3.Delete 1 366ms 132825ms taskqueue.BulkAdd 1 7ms 0ms
Calling Delete is the most expensive part of the operation!
Is there any way around this? Nick Johnson mentioned that using volumetric deletion is the fastest way to delete at present, but ideally we donโt want to delete all good entities, only those that match and are children of our initial bar = foo query.