Listing, counting coefficients of unique Mongo DB values ​​for all keys

I am preparing a descriptive "schema" (quelle horreur) for MongoDB, with which I worked.

I used the excellent variant.js to create a list of all the keys and show the coverage of each key. However, in cases where the values ​​corresponding to the keys have a small set of values, I would like to be able to list the entire set as "available values". In R, I would consider them as “factors” for a categorical variable, i.e. Gender: ["M", "F"].

I know that I can just use R + RMongo, query each variable and basically do the same procedure that I would like to create a histogram, but I would like to know a suitable Mongo.query () / javascript / Map, Reduce way to approach to this. I understand that the db.collection.aggregate () functions are just for that.

Before asking for this, I indicated:

But he cannot get the correct piping order. So, for example, if I have such documents:

{_id : 1, "key1" : "value1", "key2": "value3"} {_id : 2, "key1" : "value2", "key2": "value3"} 

I would like to return something like:

 {"key1" : ["value1", "value2"]} {"key2" : ["value3"]} 

Or better, with scores:

 {"key1" : ["value1" : 1, "value2" : 1]} {"key2" : ["value3" : 2]} 

I admit that one of the problems is that these will be any values ​​that have a wide range of different meanings - like text fields or continuous variables. Ideally, if there were more than x different possible values, it would be nice to truncate, say, no more than 20 unique values. If I actually find this more, I would request this variable directly.

This is something like:

 db.collection.aggregate( {$limit: 20, $group: { _id: "$??varname", count: {$sum: 1} }}) 

First, how can I refer to varname? for the name of each key?

I saw this link, on which it was 95%: Combination and tabulation (unique / quantity) in Mongo

with...

 input data: { "_id" : 1, "age" : 22.34, "gender" : "f" } { "_id" : 2, "age" : 23.9, "gender" : "f" } { "_id" : 3, "age" : 27.4, "gender" : "f" } { "_id" : 4, "age" : 26.9, "gender" : "m" } { "_id" : 5, "age" : 26, "gender" : "m" } 

This script:

 db.collection.aggregate( {$project: {gender:1}}, {$group: { _id: "$gender", count: {$sum: 1} }}) 

It produces:

 {"result" : [ {"_id" : "m", "count" : 2}, {"_id" : "f", "count" : 3} ], "ok" : 1 } 

But I don’t understand how I could do this in the general case for an unknown number / name of keys with a potentially large number of return values? This sample knows that the key name is gender, and that the set of answers will be small (2 values).

+4
source share
1 answer

If you have already run a script that displays the names of all the keys in the collection, you can dynamically generate the aggregation structure pipeline. This means either an extension of the type .js script, or just writing your own.

This is what it might look like in JS if it passed an array called "keys" that has several non- "_ id" named fields (I accept top-level fields and that you don't need arrays, embedded documents, etc. )

 keys = ["key1", "key2"]; group = { "$group" : { "_id" : null } } ; keys.forEach( function(f) { group["$group"][f+"List"] = { "$addToSet" : "$" + f }; } ); db.collection.aggregate(group); { "result" : [ { "_id" : null, "key1List" : [ "value2", "value1" ], "key2List" : [ "value3" ] } ], "ok" : 1 } 
+1
source

All Articles