How does mapreduce sort and shuffle?

I use the MRJob yelps library to achieve map reduction functionality. I know that map reduce has a built-in sort and shuffle algorithm that sorts values ​​based on their keys. Therefore, if I have the following results after the map phase

(1, 24) (4, 25) (3, 26) 

I know that the sort and shuffle phase will produce the following output

 (1, 24) (3, 26) (4, 25) 

As was expected

But if I have two similar keys and different values, why does the sort and shuffle phase sort the data based on the first value that appears?

For example, if I have the following list of values ​​from mapper

 (2, <25, 26>) (1, <24, 23>) (1, <23, 24>) 

Expected Result:

 (1, <24, 23>) (1, <23, 24>) (2, <25, 26>) 

But the conclusion that I get is

 (1, <23, 24>) (1, <24, 23>) (2, <25, 26>) 

Is this MRjob library specific? Is it worth it to stop this sorting based on values?

CODE

 from mrjob.job import MRJob import math class SortMR(MRJob): def steps(self): return [ self.mr(mapper=self.rangemr, reducer=self.rangesort)] def rangemr(self, key, line): for a in line.split(): yield 1,a def rangesort(self,numid,line): for a in line: yield(1, a) if __name__ == '__main__': SortMR.run() 
+4
source share
4 answers

The local MRjob simply uses the "sort" of the operating system at the output of the mapper.

The cartographer writes in the format:

key <-tab-> value \ n

This way you get keys sorted mainly by key, but secondly by value.

As already noted, this does not happen in the real version of hadoop, but only in local modeling.

+3
source

The only way to sort the values ​​is to use a composite key that contains some information from the value itself. Then your compareTo method can ensure that the keys are first sorted by the actual key component, and then by the value component. Finally, you will need a group separator to make sure that in the reducer all keys with the same β€œkey” component (the actual key) are considered equal, and the associated values ​​are repeated in a single call to the reduction method.

This is called a "secondary grade", a similar question that contains some links to examples.

+4
source

The sorting and shuffling phase does not depend on the order of the values ​​that the reducer receives for the given key.

0
source

Sorting in hasoop is based on keys and therefore does not guarantee the order of values.

0
source

All Articles