MapReduce and SQL GROUP BY

Question

MapReduce and SQL GROUP BY

I'm trying to understand the basics of MapReduce in MongoDB and even after implementing it, I'm not sure how it differs from SQL GROUP BY or even Mongo's own GROUP BY. In a SQL server, GROUP BY can be executed by a thread or a hash aggregate. Is MapReduce like a hash aggregate, just on a lot of servers?

I read in places where MR for MongoDB should run as a background process, as this is a "heavy operation". Given that the data is plastered, won't GROUP BY be equally "heavy"? However, I am only trying to compare the types of operations that can be implemented both as an MR job and using a GROUP BY query.

Is there something that GROUP BY cannot do, and only MR can do?

Also, Hadoop seems to be very good at MR (this is just what I read. I have never worked on Hadoop). How is the Hadoop MR different from the Mongo model?

I'm confused. Please help or advise me on a good tutorial that explains the need for MapReduce.

+7

group-by mongodb mapreduce hadoop

Aafreen sheikh Jul 6 '12 at 8:15

source share

2 answers

Many people use MongoDB as a data warehouse and Hadoop for processing, since there is a connector between them. Each MongoDB node can process several Hadoop nodes by reading them. As a note, I would recommend dividing Mongo and Hadoop nodes into memory.

If you don’t have them, here are some documents for you.

Another thing worth paying attention to is the new aggregation structure coming out in 2.2 . Here's a chart , equating operations in SQL with tags in a MongoDB aggregation structure.

+3

Mark hillick Jul 6 '12 at 9:35

source share

Ms01 · Accepted Answer · 2012-07-06T09:00:01+0000

What you get with MR is speed. GROUP BY is a slow operation in SQL and MR is even slower in MongoDB. But you do what you create new collections and iterate through them in real time. This is very good when you have a large amount of data and want to be able to iterate over in real time.

In the project I'm working on, there is a Python script running in the background (cron job) that runs different cards / reduces once a day. Instead of iterating over large tables using an SQL group, we repeat it once using MR and then quickly iteratively create new created collections.

I have no experience with Hadoop. So I'm sorry I can't fill you there.

Tutorial: http://www.mongovue.com/2010/11/03/yet-another-mongodb-map-reduce-tutorial/

EDIT:

Here you can see the whole translation of the SQL query to the MongoDB Map / Reduce map: This is taken from: http://rickosborne.org/download/SQL-to-MongoDB.pdf

MapReduce and SQL GROUP BY

More articles: