I created several programs that collect a large amount of information and store it in raw form in the Mongo database. Later, at predetermined time periods, map_reduce operations are map_reduce that evaluate subsets of this information.
Raw data must be stored, cannot be deleted, but map_reduce operations map_reduce not necessarily need the results of all raw data to work. Instead, I built map_reduce operations map_reduce that they can only work with the latest collected data that remains to be evaluated. The second map_reduce operation map_reduce called later, which handles the reduction of refined source data.
Then I need to specify a query filter so that the raw data that was once reduced is not reduced in every map_reduce operation. The solution I came across was to specify a filter (or pass the map_reduce request to the request) that would select ONLY records with the date_collected field newer than the given date.
At first I tried using the following code:
for k in SomeData.objects.filter(date_collected__gt=BULK_REQUEST_DATE).map_reduce(map_f, reduce_f, {'merge':'COLLECTION'}): print k.value
I also tried this with a smaller filter (just to make sure I didn't think about dates back). This also did not work.
Now here is what is interesting. If I were to remove the chain_reduce method call with chain and type k on top, as in:
for k in SomeData.objects.filter(date_collected__gt=BULK_REQUEST_DATE): print k
The filter works fine, and only data collected after a certain point in time is selected.
Then I hacked the MongoEngine queryset.py module and added an optional parameter to the map_reduce method map_reduce that the request could be passed to the map_reduce function, as in:
q = {'date_collected' : {'$lte' : BULK_REQUEST_DATE}} for k in SomeData.objects.filter(date_collected__lte=BULK_REQUEST_DATE).map_reduce(map_f, reduce_f, {'merge':'COLLECTION'}, query=q): print k.value
Again, this did not produce the expected results. However, there were no errors. I was able to cancel the map_reduce operation by submitting an incorrectly configured request or by changing the $lte advanced query operator to something like $asdfjla , so I know that any request that I missed the map_reduce method was evaluated and at least did not cause problems.
In all of the above methods for executing the map_reduce operation, the completeness of all data in the raw storage was analyzed. None of my attempts violated the map_reduce operation, but they were also unable to limit the request to a subset of the data.
Can anyone point out a flaw in my date comparison logic?
Dates are stored in the mongo database as python datetime.time. I also tried changing dates in ISOformat before comparing two dates. This does not work on python or javascript side.
Any help would be greatly appreciated! Thanks.
UPDATE
I decided that the problem is NOT with MongoEngine.
The problem is how PyMongo Datetime objects are compared in javascript using operators such as "$ gte" or "$ lte". For some reason, datetime objects are not treated as such or are not properly converted to javascripts dates.
I have not yet been able to find much more than this, but if you have any pointers, I would definitely use them!
UPDATE
I switched from testing MongoEngine to direct testing PyMongo. The following code does not produce the expected results. Note: epochtime is a field that contains the number of seconds (in int) from the era in which the document was created. The timestamp is also an int created at runtime.
j = db.data.map_reduce(map_f, reduce_f, {'merge':'COLLECTION'}, query={'epochtime':{'$lte':timestamp}}) for x in j.find(): print x
I would expect that when "$ lte" is used, the for loop will print x, since timestamp> epochtime is always. Alternatively, I would expect that if "$ gte" were used, the values ββwould not be printed. Instead, the same values ββare printed in both cases; it makes no difference when using the $ lte or $ gte operator.
What am I missing?
UPDATE
I applied the same operation as in my previous update, expect that instead of the number of seconds from the moment I created, I reset each epochtime field in the collection to be an increased number, starting from 1. I also set timestamp = 1. Then I performed the map_reduce operation. It worked correctly.
Does this make me think there is a problem with the byte size of the fields? I reproduced the above results using the float field. It worked for small floats, but not for a float representing the number of seconds (with decimals) from an era.
I definitely miss something fundamental here ...
UPDATE
Find what may cause the problem. When I use the merge output function for map_reduce, it successfully filters based on the request and then saves the first snapshot of the data to the specified collection. However, this only works once. An afterword condition in a query does not work sequentially, if at all. It only seems to be a merge output function. This does not happen when using replacement, reduction, or inline output methods. In addition, it seems that when the merge function is used a second time in the same set, the condition in the query argument depends on the size of the two values ββbeing compared - see the previous update.
I have no idea what this means or why this is happening.