Hadoop sends a record to all gearboxes

How can I send a specific record to all my gearboxes?

I know the Partitioner class and what it does, but I don’t see an easy way to make sure that the record applies to all gearboxes.

Basically, Partitioner has this method:

int getPartition(K2 key, V2 value, int numPartitions) 

My first idea was to connect Partitioner and Mapper as follows: Mapper continues to output the record several times equal to the number of reduction tasks, and Partitioner returns all int (from 0 to numPartitions-1), so make sure that the record reaches all sections.

Are there any other, smarter ways to solve this? For example, I return -1 for the records I need for all sections, and the framework does this for me when it sees the returned -1.

+6
source share
1 answer

The separator does not work this way. Its task is to look at the key (usually) and the value (rarely) to determine which gearbox should be sent. This occurs after the converter and before the gearbox.

Instead, you (the cartographer) should be able to specify a configuration context that can respond to the total number of reducers (partitions). Then your mapper can print a complex key containing the actual key you want and the section number. You know how many times to write this, because the cartographer can find out the number of gearboxes (see above). All you need to do for the separator is to break down the composite key value, retrieve the index of the target gearbox, and return that index.

By the way, this means that if you use this technique to send counters (if you sort) or other metadata that will be used later in processing, then your real data keys should follow the same composite format, In fact, you probably you will have to include an indicator in the composite key that describes the type of key / value pair (for example, 1 = real data, 0 = processing metadata).

+5
source

Source: https://habr.com/ru/post/923522/


All Articles