MongoDB: aggregation structure: get last dated document for group id

I want to get the last document for each station with all other fields:

{ "_id" : ObjectId("535f5d074f075c37fff4cc74"), "station" : "OR", "t" : 86, "dt" : ISODate("2014-04-29T08:02:57.165Z") } { "_id" : ObjectId("535f5d114f075c37fff4cc75"), "station" : "OR", "t" : 82, "dt" : ISODate("2014-04-29T08:02:57.165Z") } { "_id" : ObjectId("535f5d364f075c37fff4cc76"), "station" : "WA", "t" : 79, "dt" : ISODate("2014-04-29T08:02:57.165Z") } 

I need to have t and a station for the last dt per station. Using the aggregation structure:

 db.temperature.aggregate([{$sort:{"dt":1}},{$group:{"_id":"$station", result:{$last:"$dt"}, t:{$last:"$t"}}}]) 

returns

 { "result" : [ { "_id" : "WA", "result" : ISODate("2014-04-29T08:02:57.165Z"), "t" : 79 }, { "_id" : "OR", "result" : ISODate("2014-04-29T08:02:57.165Z"), "t" : 82 } ], "ok" : 1 } 

Is this the most efficient way to do this?

thanks

+6
source share
3 answers

To answer your question, yes, this is the most effective way. But I think we need to find out why this is so.

As suggested in the alternatives, one thing that people look at is “sorting” your results before going to the $group stage, and what they are looking for is the timestamp value, so you want to make sure everything is in the "timestamp" order, hence the form:

 db.temperature.aggregate([ { "$sort": { "station": 1, "dt": -1 } }, { "$group": { "_id": "$station", "result": { "$first":"$dt"}, "t": {"$first":"$t"} }} ]) 

And as stated, you certainly want the index to reflect this in order to make sorting efficient:

However, this is the real point. What, apparently, was missed by others (if not so for themselves), is that all this data is likely to be inserted already in time, since each reading is recorded as added.

Thus, the beauty of this _id field (with _id by default) is already in the "timestamp" order, since it itself contains the time value, and this makes the statement

 db.temperature.aggregate([ { "$group": { "_id": "$station", "result": { "$last":"$dt"}, "t": {"$last":"$t"} }} ]) 

And this faster. What for? Well, you do not need to select an index (an additional code to call), you also do not need to “load” the index in addition to the document.

We already know that the documents are in order (on _id ), so the $last borders are perfectly valid. You scan everything all the same, and you can also “set” the query by _id values ​​as equally valid for two dates.

The only thing that can be said here is that in the "real world" use may be more practical for you $match between date ranges when doing this kind of accumulation as opposed to getting the "first" and "last" _id to define the "range" or something similar in your actual use.

So where is the evidence? Well, this is pretty easy to reproduce, so I just did it by creating some sample data:

 var stations = [ "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY" ]; for ( i=0; i<200000; i++ ) { var station = stations[Math.floor(Math.random()*stations.length)]; var t = Math.floor(Math.random() * ( 96 - 50 + 1 )) +50; dt = new Date(); db.temperatures.insert({ station: station, t: t, dt: dt }); } 

On my equipment (an 8 GB laptop with a backrest drive that is not stellar, but certainly adequate) that performs each form of instruction, it clearly shows a noticeable pause with the version using index and sorting (the same keys by index as sorting expression). This is only a slight pause, but the difference is significant enough to notice.

Even looking at the conclusion of the explanation (version 2.6 and higher or actually present in 2.4.9, although not documented), you can see the difference in this, although $sort optimized due to the presence of the index, a time that seems to be refers to selecting an index and then loading indexed records. Enabling all fields for a "private" index query does not matter.

Also, for recording, purely indexing the date and only sorting by date values ​​give the same result. Perhaps a little faster, but still slower than the natural form of the index without sorting.

So, as long as you can happily "vary" from the first and last _id values, it is true that using a natural index in insertion order is actually the most efficient way to do this. Your actual mileage in the world may vary depending on how practical it is for you or not, and it may just be more convenient to implement the index and sort by date.

But if you were happy with using the _id ranges or more than the "last" _id in your query, then maybe one setting to get the values ​​along with your results so that you can actually store and use this information in subsequent queries:

 db.temperature.aggregate([ // Get documents "greater than" the "highest" _id value found last time { "$match": { "_id": { "$gt": ObjectId("536076603e70a99790b7845d") } }}, // Do the grouping with addition of the returned field { "$group": { "_id": "$station", "result": { "$last":"$dt"}, "t": {"$last":"$t"}, "lastDoc": { "$last": "$_id" } }} ]) 

And if you actually “followed” such results, you can determine the maximum ObjectId value from your results and use it in the next query.

In any case, enjoy playing with this, but again, Yes, in this case, this query is the fastest way.

+5
source

An index is all you really need:

 db.temperature.ensureIndex({ 'station': 1, 'dt': 1 }) for s in db.temperature.distinct('station'): db.temperature.find({ station: s }).sort({ dt : -1 }).limit(1) 

of course using any syntax really valid for your language.

Editing: You are correct that a cycle like this is transferred for one trip to one station, and this is great for several stations, and it is not so good for 1000. You still need a composite index on the station + dt, and use downward sorting :

 db.temperature.aggregate([ { $sort: { station: 1, dt: -1 } }, { $group: { _id: "$station", result: {$first:"$dt"}, t: {$first:"$t"} } } ]) 
+2
source

Regarding the aggregation request you posted, I would make sure you have an index on dt:

 db.temperature.ensureIndex({'dt': 1 }) 

This will make sure that sorting $ at the beginning of the aggregation pipeline is as efficient as possible.

As for whether this is the most efficient way to get this data, and also the query in the loop, most likely will be a function of how much data you have. In the beginning, with “thousands of stations” and possibly hundreds of thousands of data points, I think the aggregation method will be faster.

However, as you add more and more data, the problem is that the aggregation request will still apply to all documents. It will become more and more expensive as you scale to millions or more documents. One approach for this case would be to add $ limit immediately after sorting $ to limit the total number of documents in question. This is a bit hacked and inaccurate, but it will help limit the total number of documents that need to be accessed.

+1
source

All Articles