How to display a card on a key from another collection

Let's say I have a collection of such users: -

{ "_id" : "1234", "Name" : "John", "OS" : "5.1", "Groups" : [{ "_id" : "A", "Name" : "Group A" }, { "_id" : "C", "Name" : "Group C" }] } 

And I have a collection of such events: -

 { "_id" : "15342", "Event" : "VIEW", "UserId" : "1234" } 

I can use mapreduce to calculate the number of events for each user, since I can just issue a "UserId" and count it, however, what I want to do is count events by groups.

If there was an array of โ€œGroupsโ€ in my event document, it would be easy, but I do not do it, and this is just an example, the actual application of this is much more complicated, and I do not want to replicate all this data into the event document.

I see an example http://tebros.com/2011/07/using-mongodb-mapreduce-to-join-2-collections/ , but I do not see how this applies in this situation, since it aggregates values โ€‹โ€‹from two places. .. all I really want to do is do a search.

In SQL, I would simply JOIN my flattened UserGroup table to the event table and just GROUP BY UserGroup.GroupName

I would be happy with a few passes of mapreduce ... first go to the UserId account in something like {"_id": "1234", "count": 9}, but I'm stuck on the next pass ... how to enable the group id

Some potential approaches that I have reviewed: -

  • Include group information in an event document (not possible)
  • Find out how to โ€œjoinโ€ the user collection or view user groups from the map function so that I can also generate the group ID (I donโ€™t know how to do this).
  • Figure out how to โ€œjoinโ€ the event and user collections in the third collection, I can run mapreduce on top

What is possible and what are the benefits / problems with each approach?

+4
source share
1 answer

Your third approach is the way:

Find out how to โ€œjoinโ€ the collection of events and users in the third collection. I can run mapreduce on top

To do this, you need to create a new collection J with the "combined" data needed to reduce the map. You can use several strategies for this:

  • Update the application to insert / update J during normal operation. This is best when you need to run MR very often and using modern data. This can significantly increase code complexity. From an implementation point of view, you can do this either directly (by writing to J ) or indirectly (by writing changes to the L log collection, and then applying the โ€œnewโ€ changes to J ). If you choose a logging approach, you'll need a strategy to determine what has changed. There are two common ones: high-watermark (based on _id or timestamp) and using the log collection as a queue with findAndModify .

  • Create / update J in batch mode. This is the way to go in the case of high-performance systems, where multiple updates from the above strategy will affect performance. It is also a way to go if you do not need to start MR very often and / or you do not need to guarantee data accuracy to the second.

If you travel with (2), you will have to iterate over the documents in the collections you need to join - as you found out, Mongo map-reduce will not help you here. There are many ways to do this:

  • If you do not have a large number of documents, and if they are small, you can iterate outside the database with a direct connection to the database.

  • If you cannot do (1), you can iterate inside the database using db.eval() . If the number of documents is not small, be sure to use nolock: true , since db.eval blocked by db.eval . This is usually the strategy that I choose because I am dealing with very large sets of documents and I cannot allow them to move them across the network.

  • If you can not do (1) and do not want to do (2), you can clone the collections in another node with a temporary database. For this, Mongo has a convenient cloneCollection . Please note that this does not work if the database requires authentication (do not ask why, this is a strange 10gen design choice). In this case, you can use mongodump and mongorestore . Once you have local data for the new database, you can participate in it at your discretion. After completing MR, you can update the results collection in your production database. I use this strategy for one-time map reduction operations with large preprocessing so as not to load production replicas.

Good luck

+1
source

All Articles