I have a job in Hadoop 0.20 that should work on large files one at a time. (This is a preprocessing step to obtain file-oriented data in a cleaner, line-based format, more suitable for MapReduce.)
I do not mind the number of output files that I have, but each output of the map can contain no more than one output file, and each output file must be sorted.
- If I run with numReducers = 0, it starts quickly, and each Mapper writes its own output file, which is fine, but the files are not sorted.
- If you add one reducer (a simple Reducer.class), this will add an unnecessary global sorting step to a single file, which takes many hours (much longer than the Map tasks are performed).
- If I add several reducers, the results of individual map tasks are mixed together, so one map output ends with several files.
Is it possible to convince Hadoop to perform sorting on the side of the map at the output of each job without using reducers or any other way to skip slow global merging?
source
share