Find the median of a large integer set using MapReduce

Is there a quick algorithm to work on the MapReduce framework to find the median from the huge Integer set?

+3
source share
1 answer

This is how I do it. This is a kind of parallel version of serial quickselect. (Some display / reduction tools may not allow you to do everything as easily ...)

Select a small, arbitrary piece of input data set. Sort it sequentially. We will use them as a whole bunch of reference points, in parallel. Call this pivots array and let its size be k .

Perform a mapping / abbreviation as follows: for each value of x in the input dataset, a binary search to find the position of x relative to pivots ; name this position bucket(x) . This is an integer between 0 and k . Reduction step - counting the number of elements in each bucket; define bucket[b] as the number x with bucket(x) = b .

The median should be in the "middle bucket". Select all the values ​​in this median bucket and use the traditional sequential selection algorithm to find the item with the correct index.

+4
source

All Articles