I have a MongoDB collection of over 1,000,000 entries. Each record size is about 20 thousand (therefore, the total collection size is about 20 GB).
The collection has a type field that can contain about 10 different values. I would like to get counters for each type for a collection. In addition, the type field has an index.
I tested two different approaches (suppose python syntax):
The naive method is using the "count" calls for each of the values:
for type_val in my_db.my_colc.distinct('type'): counters[type_val] = my_db.my_colc.find({'type' : type_val}).count()
Using an aggregation structure with the syntax '$ group':
counters = my_db.my_colc.aggregate([{'$group' : {'_id': '$type', 'agg_val': { '$sum': 1 } }}])
The performance that I get for the first approach is about 2 orders of magnitude higher than for the second approach. It seems to be connected with the fact that the counter works only on indexes, without access to documents, while the $ group should iterate over documents one at a time. (This is approximately 1 minute versus 45 minutes).
Is there a way to run an efficient grouping query by type index that will only use the index, thereby achieving performance results from # 1, but using the aggregation structure?
I am using MongoDB 2.6.1
Update: https://jira.mongodb.org/browse/SERVER-11447 is open on this issue in MongoDB Jira.
performance mongodb pymongo
Baruch oxman
source share