Find the median of a large integer set using MapReduce

Question

Find the median of a large integer set using MapReduce

Is there a quick algorithm to work on the MapReduce framework to find the median from the huge Integer set?

+3

mapreduce

user882107 Aug 6 '11 at 16:40

source share

1 answer

Louis Wasserman · Answer 1 · 2011-10-06T18:59:49+0000

This is how I do it. This is a kind of parallel version of serial quickselect. (Some display / reduction tools may not allow you to do everything as easily ...)

Select a small, arbitrary piece of input data set. Sort it sequentially. We will use them as a whole bunch of reference points, in parallel. Call this pivots array and let its size be k .

Perform a mapping / abbreviation as follows: for each value of x in the input dataset, a binary search to find the position of x relative to pivots ; name this position bucket(x) . This is an integer between 0 and k . Reduction step - counting the number of elements in each bucket; define bucket[b] as the number x with bucket(x) = b .

The median should be in the "middle bucket". Select all the values in this median bucket and use the traditional sequential selection algorithm to find the item with the correct index.

Find the median of a large integer set using MapReduce

More articles: