Can I get fuzzy sorted Mapper outputs from Hadoop when using zero gears?

I have a job in Hadoop 0.20 that should work on large files one at a time. (This is a preprocessing step to obtain file-oriented data in a cleaner, line-based format, more suitable for MapReduce.)

I do not mind the number of output files that I have, but each output of the map can contain no more than one output file, and each output file must be sorted.

  • If I run with numReducers = 0, it starts quickly, and each Mapper writes its own output file, which is fine, but the files are not sorted.
  • If you add one reducer (a simple Reducer.class), this will add an unnecessary global sorting step to a single file, which takes many hours (much longer than the Map tasks are performed).
  • If I add several reducers, the results of individual map tasks are mixed together, so one map output ends with several files.

Is it possible to convince Hadoop to perform sorting on the side of the map at the output of each job without using reducers or any other way to skip slow global merging?

+5
source share
4 answers

- . mapper. n , n - . , , .

. - . , , , 5 .

, , - , . , , . , .

, .

+2

- , .

, , mapper , , , ? , id , .

, ?

: Hadoop ( "shuffle" ), , , , , .

+2

, , , . / /.

+1

. - . , , .

, , Combiner . , http://hadoop.apache.org/common/docs/r0.20.1/mapred_tutorial.html ( /):

Users can optionally specify a combiner through JobConf.setCombinerClass (class) to perform local aggregation of intermediate outputs, which helps reduce the amount of data transferred from Mapper to Reducer.

My reading of this is that if you specified an identity reducer as a combiner, then each display pin should be sorted.

0
source

All Articles