MapReduce Jobs Sort Order

I can see in my mapreduce assignments that the result of the gear section is sorted by the key ..

therefore, if I set the number of reducers to 10, the output directory will contain 10 files, and each of these output files has sorted data.

the reason I conclude this is because even if all the files are sorted, but these files are not sorted. for example: there are scenarios in which part-000 * files start at 0 and end at zzzz, assuming I use Text as the key.

I was sure that the file should be sorted even in files. ie file 1 must have the last part of the file - 00009 must have records with zzzz or atleaset> a

Assuming I have all the alphanumeric distributed keys.

can someone throw some light why this behavior

+6
source share
4 answers

You can get a globally sorted file (which you basically want) using these methods:

  • Use only one reducer in mapreduce (bad idea! This works too much on one machine).
  • Write a custom separator. Partioner is a class that shares key space in mapreduce. By default, partioner ( Hashpartioner ) evenly divides the key space by the number of reducers. Check out this example to write a custom member.

  • Use Hadoop Pig / Hive to sort.

+8
source
Q :all the files have sorted data but these files itself are not sorted.. 

Ans: A custom hashpartitioner is used by default to split intermediate output (from mapper).

Example:

 If the intermediate values are 3,4,5,6,7,8,9,10,11 Then the data will be partitioned into (lets say) Reducer: R1{7,4,10} R2{5,11,8} R3{9,6,3} 

So now flat files will have

 Part-00000 {4,,7,11} Part-00001 {5,8,11} Part-00002 {3,6,9} 

If you are looking for sorting by value: Here is ans

0
source

The retention order uses a single reducer, so you can use the distribution by / sorting, and then from the sorted table that you can insert, rewrite the local table from the table - write the data to a file

0
source

General sorting

All pairs of key values ​​from a particular key have reached a specific gearbox. This will happen through Partitioners at the Mapper level. Mapper-level combinators will act as Semi-reducers and send the values ​​of a specific key to Reducer. HashPartitioner is the best separator for determining the number of gearboxes.

The output of the reducer will be a single file that has all the output sorted based on the key.

Secondary sorting

Used to determine how card output keys are sorted. He works at the Mapper level. In this case, we will be able to control the order of values ​​along with the keys. This sorting can be performed with two or more field values.

See General Sort Order and Secondary Sort

0
source

All Articles