Calculating the median on the map reduces

Can someone reduce the example of calculating median / quantiles on a map?

My understanding of the Datafu median is that β€œn” displays the data sorting and send the data to the β€œ1” reducer, which is responsible for sorting all the data from n mappers and finding the median (average) I understand correctly ?,

if so, does this approach apply for a huge amount of data, since I can clearly see one single gearbox struggling to complete the final task. thanks

+13
source share
4 answers

Trying to find the average number (average) in a series will require that 1 gearbox transmit the entire range of numbers in order to determine which value is β€œaverage”.

Depending on the range and uniqueness of the values ​​in your input data set, you can enter a combiner to display the frequency of each value - reducing the number of card outputs sent to your only gearbox. Your gearbox can then use value / frequency pairs to identify the median.

Another way you could scale this (again, if you know the range and the rough distribution of values) is to use a custom separator that distributes keys across the ranges (0-99 go to gearbox 0, 100-199 to gearbox 2, and so on). Nevertheless, this will require some additional work to study the outputs of the gearbox and perform the final median calculation (knowing, for example, the number of keys in each gearbox, you can calculate which gearbox output will contain the median and at what offset)

+12
source

Do you really need accurate median and quantile numbers?

In most cases, it is best for you to get approximate values ​​and work with them, in particular if you use this for, for example, data separation.

In fact, you can use approximate quantiles to speed up the search for exact quantiles (actually in O(n/p) time), here is an example strategy plan:

  • Ask the correlator for each section to calculate the required quantiles and display them in a new data set. This data set should be several times smaller (unless you ask for too many quantiles!)
  • In this dataset, again compute quantiles similar to the "median median." These are your initial estimates.
  • Rearrange the data according to these quantiles (or even additional sections obtained in this way). The goal is that ultimately the true quantile is guaranteed to be in one section, and in each section there should be no more than one of the desired quantiles.
  • Within each section, run QuickSelect (in O(n) ) to find the true quantile.

Each of the steps is in linear time. The most expensive step is part 3, as this will require a redistribution of the entire data set, so it generates O(n) network traffic. You can probably optimize the process by selecting the "alternative" quantiles for the first iteration. Say you want to find the global median. You cannot easily find it in a linear process, but you can probably narrow it to 1 / kth data set when it is divided into k partitions. Therefore, instead of each node reporting its median, each node additionally reports objects in (k-1) / (2k) and (k + 1) / (2k). This should allow you to narrow the range of values ​​where the true median should lie distinctly. So, in the next step, you can send each node that objects that are within the required range to one node master and select only the median only in this range.

+7
source

O ((n log n) / p) to sort it, and then O (1) to get the median.

Yes ... you can get O (n / p), but you cannot use the out-of-box sort function in Hadoop. I would just sort and get the center element if you cannot justify 2-20 hours of development to encode the parallel k-th algorithm.

+2
source

In many real-world scenarios, the power of the values ​​in the data set will be relatively small. In such cases, the problem can be effectively solved using two MapReduce tasks:

  • Calculate the frequency of values ​​in your dataset (basically, Word Count job)
  • Identity mapping module + reducer, which calculates the median based on the <value - frequency> pair

Work 1. significantly reduce the amount of data and can be performed completely in parallel. Task 2. reducer should process only n tags ( n = cardinality of your value set ) instead of all values, as with a naive approach.

The following is an example of job abbreviation 2. This is a python script that can be used directly in the Hadoop stream. Assumes the values ​​in your dataset are ints , but can be easily taken for double s

 import sys item_to_index_range = [] total_count = 0 # Store in memory a mapping of a value to the range of indexes it has in a sorted list of all values for line in sys.stdin: item, count = line.strip().split("\t", 1) new_total_count = total_count + int(count) item_to_index_range.append((item, (total_count + 1, new_total_count + 1))) total_count = new_total_count # Calculate index(es) of middle items middle_items_indexes = [(total_count / 2) + 1] if total_count % 2 == 0: middle_items_indexes += [total_count / 2] # Retrieve middle item(s) middle_items = [] for i in middle_items_indexes: for item, index_range in item_to_index_range: if i in range(*index_range): middle_items.append(item) continue print sum(middle_items) / float(len(middle_items)) 

This answer is based on an assumption based on Chris White 's answer. The answer involves using a combiner as an average to calculate frequency values. However, MapReduce combines are not always guaranteed. This has some side effects:

  • the gearbox must first calculate the final <value - frequency> of the pair, and then calculate the median.
  • In the worst case, combinators will never be executed, and the reducer will still have to struggle with processing all the individual values.
0
source

All Articles