MongoDb union - splitting into temporary buckets

Question

MongoDb union - splitting into temporary buckets

Is it possible to use the MongoDB aggregation structure to generate time series output, where any source documents that are considered to fall into each bucket are added to this bucket?

Say my collection looks something like this:

/*light_1 on from 10AM to 1PM*/ { "_id" : "light_1", "on" : ISODate("2015-01-01T10:00:00Z"), "off" : ISODate("2015-01-01T13:00:00Z"), }, /*light_2 on from 11AM to 7PM*/ { "_id" : "light_2", "on" : ISODate("2015-01-01T11:00:00Z"), "off" : ISODate("2015-01-01T19:00:00Z") }

.. and I use the time interval of 6 hours to create a report for 2015-01-01. I want my result to look something like this:

  { "start" : ISODate("2015-01-01T00:00:00Z"), "end" : ISODate("2015-01-01T06:00:00Z"), "lights" : [] }, { "start" : ISODate("2015-01-01T06:00:00Z"), "end" : ISODate("2015-01-01T12:00:00Z"), "lights_on" : ["light_1", "light_2"] }, { "start" : ISODate("2015-01-01T12:00:00Z"), "end" : ISODate("2015-01-01T18:00:00Z"), "lights_on" : ["light_1", "light_2"] }, { "start" : ISODate("2015-01-01T18:00:00Z"), "end" : ISODate("2015-01-02T00:00:00Z"), "lights_on" : ["light_2"] }

light is considered 'on' during a range if its value is 'on' the end of the bucket and its value is off. > = "start" of bucket

I know that I can use $ group and aggregation date operators to group at the beginning or at the end of time, but in this case it is a one-to-one mapping. Here, one source document can make it into multiple time codes if it spans multiple buckets.

The report range and interval interval are not yet known.

+5

mongodb time-series mongodb-query mapreduce aggregation-framework

David black Jul 29 '15 at 9:32

source share

3 answers

I initially misunderstood your question. Assuming I understand what you need now, it is more like working to reduce the map. I'm not sure how you define a range or interval interval, so I will make these constants, you can change this section of code correctly. You can do something like this:

 var mapReduceObj = {}; mapReduceObj.map = function() { var start = new Date("2015-01-01T00:00:00Z").getTime(), end = new Date("2015-01-02T00:00:00Z").getTime(), interval = 21600000; //6 hours in milliseconds var time = start; while(time < end) { var endtime = time + interval; if(this.on < endtime && this.off >= time) { emit({start : new Date(time), end : new Date(endtime)}, [this._id]); break; } time = endtime; } }; mapReduceObj.reduce = function(times, light_ids) { var lightsArr = {lights : []}; for(var i = 0; i < light_ids.length; i++) { lightsArr.lights.push(light_ids[i]); } return lightsArr; };

The result will be as follows:

 results : { _id : { start : ISODate("2015-01-01T06:00:00Z"), end : ISODate("2015-01-01T12:00:00Z") }, value : { lights : [ "light_6", "light_7" ] }, ... }

~ Original answer ~

This will give you the exact format you want.

 db.lights.aggregate([ { "$match": { "$and": [ { on : { $lt : ISODate("2015-01-01T06:00:00Z") } }, { off : { $gte: ISODate("2015-01-01T12:00:00Z") } } ] }}, { "$group": { _id : null, "lights_on" : {$push : "$_id"} }}, { "$project": { _id : false, start : { $add : ISODate("2015-01-01T06:00:00Z") }, end : { $add : ISODate("2015-01-01T12:00:00Z") }, lights_on: true }} ]);

First, the $match clause finds all documents matching your time constraints. Then $group pushes the _id field (in this case light_n , where n is an integer) in the lights_on field. Either $addToSet or $push can be used because the _id field _id unique, but if you use a field that can contain duplicates, you will need to decide if duplicates are allowed in the array. Finally, use $project to get the exact format you want.

+2

c1moore Jul 29 '15 at 13:23

source share

One way is to use the $ cond operator of the $ project operator and compare each “start” and “end” with “on” and “off” in the original collection. Move around each bucket using the MongoDB client and do something like this:

 db.lights.aggregate([ { "$project": { "present": { "$cond": [ { "$and": [ { "$lte": [ "$on", ISODate("2015-01-01T06:00:00Z") ] }, { "$gte": [ "$off", ISODate("2015-01-01T12:00:00Z") ] } ]}, 1, 0 ]} }} ]);

The result should look something like this:

 { "_id" : "light_1", "present" : 0 } { "_id" : "light_2", "present" : 0 } { "_id" : "light_3", "present" : 1 }

For all documents with {"present":1} add the "_id" collection of lights to the "lights_on" field with your client. Hope this helps.

0

void Jul 29 '15 at 12:21

source share

Blakes seven · Accepted Answer · 2015-07-29T23:17:49+0000

Introduction

Your goal here requires a little reflection on considerations about when to record events, how you structured them into aggregation data of a time period. The obvious point is that one document that you imagine can actually represent events that will appear in “multiple” time periods in the final aggregated result.

Thus, analysis is a problem that goes beyond the structure of aggregation due to the time periods that appear to be. Some events should be “generated” beyond what you can simply group, what you should see.

To do this "generataion", you need mapReduce . This has "flow control" via JavaScript as a processing language, in order to be able to substantially determine whether more than one period has passed between on / off and therefore emits data that has occurred in more than one of these periods.

As a side note, “light” is probably not suitable for _id , as it can be turned on / off many times during a given day. Thus, an “on” / off instance is most likely better. However, I just follow your example, therefore, to transform this, simply replace the _id link in the cartographer’s code with the fact that the actual field is the light identifier.

But to the code:

 // start date and next date for query ( should be external to main code ) var oneHour = ( 1000 * 60 * 60 ), sixHours = ( oneHour * 6 ), oneDay = ( oneHour * 24 ), today = new Date("2015-01-01"), // your input tomorrow = new Date( today.valueOf() + oneDay ), yesterday = new Date( today.valueOf() - sixHours ), nextday = new Date( tomorrow.valueOf() + sixHours); // main logic db.collection.mapReduce( // mapper to emit data function() { // Constants and round date to hour var oneHour = ( 1000 * 60 * 60 ) sixHours = ( oneHour * 6 ) startPeriod = new Date( this.on.valueOf() - ( this.on.valueOf() % oneHour )), endPeriod = new Date( this.off.valueOf() - ( this.off.valueOf() % oneHour )); // Hour to 6 hour period and convert to UTC timestamp startPeriod = startPeriod.setUTCHours( Math.floor( startPeriod.getUTCHours() / 6) * 6 ); endPeriod = endPeriod.setUTCHours( Math.floor( endPeriod.getUTCHours() / 6) * 6 ); // Init empty reults for each period only on first document processed if ( counter == 0 ) { for ( var x = startDay.valueOf(); x < endDay.valueOf(); x+= sixHours ) { emit( { start: new Date(x), end: new Date(x + sixHours) }, { lights_on: [] } ); } } // Emit for every period until turned off only within the day for ( var x = startPeriod; x <= endPeriod; x+= sixHours ) { if ( ( x >= startDay ) && ( x < endDay ) ) { emit( { start: new Date(x), end: new Date(x + sixHours) }, { lights_on: [this._id] } ); } } counter++; }, // reducer to keep all lights in one array per period function(key,values) { var result = { lights_on: [] }; values.forEach(function(value) { value.lights_on.forEach(function(light){ if ( result.lights_on.indexOf(light) == -1 ) result.lights_on.push(light); }); }); result.lights_on.sort(); return result; }, // options and query { "out": { "inline": 1 }, "query": { "on": { "$gte": yesterday, "$lt": tomorrow }, "$or": [ { "off": { "$gte:" today, "$lt": nextday } }, { "off": null }, { "off": { "$exists": false } } ] }, "scope": { "startDay": today, "endDay": tomorrow, "counter": 0 } } )

Map and abbreviation

In essence, the "mapper" function scans the current record, rounds each on / off time to hours, and then displays the start hour from which the event occurred in six hours.

With these new date values, the loop starts to take the initial time “on” and emit an event for the current “light” turned on during this period, within the same array of elements, as explained below. Each cycle increases the initial period by six hours until a "bright" time is reached.

They appear in the function of the reducer, which requires the same expected input that it will return, therefore, therefore, the array of lights is included in the period inside the value object. It processes the emitted data under the same key as the list of these value objects.

First, we sort through the list of values that need to be reduced, and then look at the internal array of lights, which could have come from the previous pass with a decrease, and processing each of them as a unique array of unique lights. Just do this by looking at the current light value in the results array and clicking on this array where it does not exist.

Pay attention to the “previous pass”, as if you are not familiar with how mapReduce works, then you should understand that the reducer function itself emits a result that may not have been achieved by processing “all” possible values for the “key” in one run. It can and often processes only a "subset" of emitted data for a key and therefore accepts a "reduced" result as an input signal in the same way that data is emitted from a display device.

This design point is why both the cartographer and the gearbox need to output data with the same structure as the gearbox itself can also get it from data that was previously reduced. This is how mapReduce deals with large datasets emitting a large number of identical key values. It is usually processed in "pieces", and not immediately.

The end reduction is reduced to a list of lights on during the period with each start and end period as an emitted key. Like this:

  { "_id": { "start": ISODate("2015-01-01T06:00:00Z"), "end": ISODate("2015-01-01T12:00:00Z") }, { "result": { "lights_on": [ "light_1", "light_2" ] } } },

This structure "_id", "result" is just a property of how the entire mapReduce output is displayed, but all the values you need.

Query

Now there is also a note on the choice of request, which should take into account that the light can already be "on" through its collection record on a date before the start of the current day. The same can be said that it can be turned off after the current date is sent, and can actually either be null or not turn off in the document, depending on how your data is stored and what day is actually observed.

This logic creates some required calculation from the beginning of the day for the message and considers a six-hour period both before and after this date with the specified query conditions:

  { "on": { "$gte": yesterday, "$lt": tomorrow }, "$or": [ { "off": { "$gte:" today, "$lt": nextday } }, { "off": null }, { "off": { "$exists": false } } ] }

The base selectors use the range operators $gte and $lt to find values that are greater than or equal to and less than the fields that they check the values to find data in a suitable range.

As part of $or , various options are considered for the value of "off". It’s either that it falls into the criteria of the range, either it is null or there may be no key at all in the document through $exists . It depends on how you actually imagine “turned off” when the light is not yet turned off regarding the requirements of these conditions within $or , but that would be reasonable assumptions.

Like all MongoDB queries, all conditions are an implicit AND expression, unless otherwise specified.

This is still somewhat erroneous depending on how long it is expected that the light will be on. But the variables are all intentionally listed externally to adjust your needs, taking into account the expected duration to receive, either before or after the date that will be reported.

Create empty time series

Another note here is that the data itself probably will not have any events that show the light turned on for a given period of time. For this reason, there is a simple method built into the mapper function that looks to see if we are at the first iteration of the results.

Only for the first time a set of possible period keys is issued, which includes an empty array for the lights on in each period. This allows the report to also show those periods when no light was turned on at all, as it is inserted into the data sent to the gearbox and output.

You may vary depending on this approach, as it still depends on the availability of some data that meets the query criteria in order to output anything. Therefore, in order to serve a truly "empty day" when data is not being recorded or does not meet the criteria, then it would be better to create an external key hash table, all showing an empty result for the lights. Then simply merge the result of the mapReduce operation into those existing keys to create the report.

Summary

There are several calculations in the dates, and, unaware of the real implementation of the final language, I simply declare something that works externally with the actual mapReduce operation separately. So, everything that looks like duplication here is done for this purpose, making this part of the logical language independent. Most programming languages support the ability to manipulate dates according to the methods used.

The inputs, which are then specific to each language, are passed as an option block, shown as the last argument to the mapReduce method. It is noteworthy that there is a query with its parameterized values, which are all calculated from the date that you want to report. Then there is a “region”, which is a way to convey values that can be used by functions in the mapReduce operation.

With all of this in mind, the JavaScript code and JavaScript reducer code remains unchanged, as this is what the method expects as input. Any variables for the process are submitted both by volume and by the query results in order to get the result without changing this code.

Mainly because the duration of the “light on” can span different periods, which should be reported as becoming what the aggregation structure cannot create. It cannot perform the “loop” and “data throw” that are needed to get the result, and therefore we use mapReduce instead for this task.

However, the big question. I do not know if you have considered the concepts of how to achieve results here already, but at least now there is a guide for someone approaching a similar problem.

MongoDb union - splitting into temporary buckets

Introduction

Map and abbreviation

Query

Create empty time series

Summary

~~~ Original answer ~~~

More articles:

~ Original answer ~