MongoEngine / PyMongo filtering compared to date on map_reduce operation fails

I created several programs that collect a large amount of information and store it in raw form in the Mongo database. Later, at predetermined time periods, map_reduce operations are map_reduce that evaluate subsets of this information.

Raw data must be stored, cannot be deleted, but map_reduce operations map_reduce not necessarily need the results of all raw data to work. Instead, I built map_reduce operations map_reduce that they can only work with the latest collected data that remains to be evaluated. The second map_reduce operation map_reduce called later, which handles the reduction of refined source data.

Then I need to specify a query filter so that the raw data that was once reduced is not reduced in every map_reduce operation. The solution I came across was to specify a filter (or pass the map_reduce request to the request) that would select ONLY records with the date_collected field newer than the given date.

At first I tried using the following code:

 for k in SomeData.objects.filter(date_collected__gt=BULK_REQUEST_DATE).map_reduce(map_f, reduce_f, {'merge':'COLLECTION'}): print k.value 

I also tried this with a smaller filter (just to make sure I didn't think about dates back). This also did not work.

Now here is what is interesting. If I were to remove the chain_reduce method call with chain and type k on top, as in:

 for k in SomeData.objects.filter(date_collected__gt=BULK_REQUEST_DATE): print k 

The filter works fine, and only data collected after a certain point in time is selected.

Then I hacked the MongoEngine queryset.py module and added an optional parameter to the map_reduce method map_reduce that the request could be passed to the map_reduce function, as in:

 q = {'date_collected' : {'$lte' : BULK_REQUEST_DATE}} for k in SomeData.objects.filter(date_collected__lte=BULK_REQUEST_DATE).map_reduce(map_f, reduce_f, {'merge':'COLLECTION'}, query=q): print k.value 

Again, this did not produce the expected results. However, there were no errors. I was able to cancel the map_reduce operation by submitting an incorrectly configured request or by changing the $lte advanced query operator to something like $asdfjla , so I know that any request that I missed the map_reduce method was evaluated and at least did not cause problems.

In all of the above methods for executing the map_reduce operation, the completeness of all data in the raw storage was analyzed. None of my attempts violated the map_reduce operation, but they were also unable to limit the request to a subset of the data.

Can anyone point out a flaw in my date comparison logic?

Dates are stored in the mongo database as python datetime.time. I also tried changing dates in ISOformat before comparing two dates. This does not work on python or javascript side.

Any help would be greatly appreciated! Thanks.

UPDATE

I decided that the problem is NOT with MongoEngine.

The problem is how PyMongo Datetime objects are compared in javascript using operators such as "$ gte" or "$ lte". For some reason, datetime objects are not treated as such or are not properly converted to javascripts dates.

I have not yet been able to find much more than this, but if you have any pointers, I would definitely use them!

UPDATE

I switched from testing MongoEngine to direct testing PyMongo. The following code does not produce the expected results. Note: epochtime is a field that contains the number of seconds (in int) from the era in which the document was created. The timestamp is also an int created at runtime.

 j = db.data.map_reduce(map_f, reduce_f, {'merge':'COLLECTION'}, query={'epochtime':{'$lte':timestamp}}) for x in j.find(): print x 

I would expect that when "$ lte" is used, the for loop will print x, since timestamp> epochtime is always. Alternatively, I would expect that if "$ gte" were used, the values ​​would not be printed. Instead, the same values ​​are printed in both cases; it makes no difference when using the $ lte or $ gte operator.

What am I missing?

UPDATE

I applied the same operation as in my previous update, expect that instead of the number of seconds from the moment I created, I reset each epochtime field in the collection to be an increased number, starting from 1. I also set timestamp = 1. Then I performed the map_reduce operation. It worked correctly.

Does this make me think there is a problem with the byte size of the fields? I reproduced the above results using the float field. It worked for small floats, but not for a float representing the number of seconds (with decimals) from an era.

I definitely miss something fundamental here ...

UPDATE

Find what may cause the problem. When I use the merge output function for map_reduce, it successfully filters based on the request and then saves the first snapshot of the data to the specified collection. However, this only works once. An afterword condition in a query does not work sequentially, if at all. It only seems to be a merge output function. This does not happen when using replacement, reduction, or inline output methods. In addition, it seems that when the merge function is used a second time in the same set, the condition in the query argument depends on the size of the two values ​​being compared - see the previous update.

I have no idea what this means or why this is happening.

+4
source share

All Articles