I use the MRJob yelps library to achieve map reduction functionality. I know that map reduce has a built-in sort and shuffle algorithm that sorts values ββbased on their keys. Therefore, if I have the following results after the map phase
(1, 24) (4, 25) (3, 26)
I know that the sort and shuffle phase will produce the following output
(1, 24) (3, 26) (4, 25)
As was expected
But if I have two similar keys and different values, why does the sort and shuffle phase sort the data based on the first value that appears?
For example, if I have the following list of values ββfrom mapper
(2, <25, 26>) (1, <24, 23>) (1, <23, 24>)
Expected Result:
(1, <24, 23>) (1, <23, 24>) (2, <25, 26>)
But the conclusion that I get is
(1, <23, 24>) (1, <24, 23>) (2, <25, 26>)
Is this MRjob library specific? Is it worth it to stop this sorting based on values?
CODE
from mrjob.job import MRJob import math class SortMR(MRJob): def steps(self): return [ self.mr(mapper=self.rangemr, reducer=self.rangesort)] def rangemr(self, key, line): for a in line.split(): yield 1,a def rangesort(self,numid,line): for a in line: yield(1, a) if __name__ == '__main__': SortMR.run()