I am preparing a descriptive "schema" (quelle horreur) for MongoDB, with which I worked.
I used the excellent variant.js to create a list of all the keys and show the coverage of each key. However, in cases where the values ​​corresponding to the keys have a small set of values, I would like to be able to list the entire set as "available values". In R, I would consider them as “factors” for a categorical variable, i.e. Gender: ["M", "F"].
I know that I can just use R + RMongo, query each variable and basically do the same procedure that I would like to create a histogram, but I would like to know a suitable Mongo.query () / javascript / Map, Reduce way to approach to this. I understand that the db.collection.aggregate () functions are just for that.
Before asking for this, I indicated:
But he cannot get the correct piping order. So, for example, if I have such documents:
{_id : 1, "key1" : "value1", "key2": "value3"} {_id : 2, "key1" : "value2", "key2": "value3"}
I would like to return something like:
{"key1" : ["value1", "value2"]} {"key2" : ["value3"]}
Or better, with scores:
{"key1" : ["value1" : 1, "value2" : 1]} {"key2" : ["value3" : 2]}
I admit that one of the problems is that these will be any values ​​that have a wide range of different meanings - like text fields or continuous variables. Ideally, if there were more than x different possible values, it would be nice to truncate, say, no more than 20 unique values. If I actually find this more, I would request this variable directly.
This is something like:
db.collection.aggregate( {$limit: 20, $group: { _id: "$??varname", count: {$sum: 1} }})
First, how can I refer to varname? for the name of each key?
I saw this link, on which it was 95%: Combination and tabulation (unique / quantity) in Mongo
with...
input data: { "_id" : 1, "age" : 22.34, "gender" : "f" } { "_id" : 2, "age" : 23.9, "gender" : "f" } { "_id" : 3, "age" : 27.4, "gender" : "f" } { "_id" : 4, "age" : 26.9, "gender" : "m" } { "_id" : 5, "age" : 26, "gender" : "m" }
This script:
db.collection.aggregate( {$project: {gender:1}}, {$group: { _id: "$gender", count: {$sum: 1} }})
It produces:
{"result" : [ {"_id" : "m", "count" : 2}, {"_id" : "f", "count" : 3} ], "ok" : 1 }
But I don’t understand how I could do this in the general case for an unknown number / name of keys with a potentially large number of return values? This sample knows that the key name is gender, and that the set of answers will be small (2 values).