Find duplicates in application data warehouse

I have some duplicate items in my data store (not the entire row, but most of the fields on it) in App Engine.

What is the best way to find them?

I have both integer and string fields that are duplicated (in case the comparison is faster than the other).

Thanks!

+7
source share
2 answers

A silly but quick approach would be to take the fields you care about, combine them as a long string, and save them as the key of the DB_Unique object that references the original object. Each time you do DB_Unique.get_or_insert() , you must verify that the reference refers to the correct source entity, otherwise you have a duplicate. This should probably be done in a map .

Something like:

 class DB_Unique(db.Model): r = db.ReferenceProperty() class DB_Obj(db.Model): a = db.IntegerProperty() b = db.StringProperty() c = db.StringProperty() # executed for each DB_Obj... def mapreduce(entity): key = '%s_%s_%s' % (entity.a,entity.b,entity.c) res = DB_Unique.get_or_insert(key, r=entity) if DB_Unique.r.get_value_for_datastore(res) != entity.key(): # we have a possible collision, verify and delete? # out two entities are res and entity 

There are a few extreme cases that may occur, for example, if you have two objects with b and c equal to ('a_b', '') and ('a', 'b_') respectively, so concatenation is "a_b_" for both. so use a character you know not in your lines, not "_", or DB_Unique.r - a list of links and compare them all.

+6
source

If this is a one-time or rare occurrence, you can try to dump the entire database to the local machine - see Downloading and loading data - load the data into sqlite3 database and find duplicate keys with it.

Trying to do this programmatically on the GAE side can be quite tedious. With tasks fully feasible, but not too easy.

+1
source

All Articles