Merge output files after reduction phase

In mapreduce, each reduction task writes its output to a file called part-r-nnnnn, where nnnnn is the section identifier associated with the reduction task. Does map / reduce contain merging these files? If so, how?

+69
mapreduce hadoop
Apr 18 '11 at 8:01
source share
10 answers

Instead of performing file merging yourself, you can delegate all file merging with decreasing output to hasoop by calling:

hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt 
+108
Apr 20 '11 at 23:50
source share

No, these files are not combined by Hadoop. The number of files you receive is the same as the number of tasks to be reduced.

If you need this as input for your next assignment, don't worry about having separate files. Just specify the entire directory as input for the next job.

If you need data outside the cluster, I usually combine it on the receiving side, pulling the data out of the cluster.

those. something like that:

 hadoop fs -cat /some/where/on/hdfs/job-output/part-r-* > TheCombinedResultOfTheJob.txt 
+24
Apr 18 2018-11-11T00:
source share

What is the function you can use to combine files in HDFS

 public boolean getMergeInHdfs(String src, String dest) throws IllegalArgumentException, IOException { FileSystem fs = FileSystem.get(config); Path srcPath = new Path(src); Path dstPath = new Path(dest); // Check if the path already exists if (!(fs.exists(srcPath))) { logger.info("Path " + src + " does not exists!"); return false; } if (!(fs.exists(dstPath))) { logger.info("Path " + dest + " does not exists!"); return false; } return FileUtil.copyMerge(fs, srcPath, fs, dstPath, false, config, null); } 
+8
Jul 02 '15 at 15:29
source share

For text files and HDFS only, use the following command as the source and destination:

hadoop fs -cat /input_hdfs_dir/* | hadoop fs -put - /output_hdfs_file

This will concatenate all the files in input_hdfs_dir and output the output back to HDFS in output_hdfs_file . Keep in mind that all data will be returned to the local system and then uploaded to hdf again, although temporary files are not created, and this happens on the fly using UNIX pe.

In addition, this will not work with non-text files such as Avro, ORC, etc.

For binaries, you can do something like this (if you have Hive tables displayed in directories):

insert overwrite table tbl select * from tbl

Depending on your configuration, this may also create more than files. To create a single file, either set the number of reducers to 1 explicitly using mapreduce.job.reduces=1 , or set the hive property to hive.merge.mapredfiles=true .

+6
Sep 16 '15 at 15:46
source share

You can run the additional map / reduce task, where the map and reduce do not change the data, and the separator assigns all the data to one reducer.

+3
Apr 18 '11 at 9:19 a.m.
source share

The part-r-nnnnn files are generated after the reduction phase indicated by "r" between them. Now the fact: if you have one gearbox, you will have an output file, for example part-r-00000. If the number of gearboxes is 2, then you will have part-r-00000 and part-r-00001 and so on. See if the output file is too large to fit into the device’s memory, since the hadoop infrastructure was designed to work on Commodity Machines , then the file splits. According to MRv1, you have a limit of 20 gears to work on your logic. You may have more, but the same thing needs to be configured in the mapred-site.xml configuration files. Talk about your question; you can either use getmerge, or you can set the number of gears to 1 by inserting the following code into the driver code

 job.setNumReduceTasks(1); 

Hope this answers your question.

+3
Oct 27 '15 at 5:47
source share

Besides my previous answer, I have another answer for you that I tried a few minutes ago. You can use CustomOutputFormat , which looks like the code below

 public class VictorOutputFormat extends FileOutputFormat<StudentKey,PassValue> { @Override public RecordWriter<StudentKey,PassValue> getRecordWriter( TaskAttemptContext tac) throws IOException, InterruptedException { //step 1: GET THE CURRENT PATH Path currPath=FileOutputFormat.getOutputPath(tac); //Create the full path Path fullPath=new Path(currPath,"Aniruddha.txt"); //create the file in the file system FileSystem fs=currPath.getFileSystem(tac.getConfiguration()); FSDataOutputStream fileOut=fs.create(fullPath,tac); return new VictorRecordWriter(fileOut); } } 

Just look at the fourth line from the last. I used my name as the name of the output file, and I tested the program with 15 reducers. However, the file remains the same. Thus, obtaining one output file instead of two or more is still possible very clearly, the size of the output file should not exceed the size of the primary memory, i.e. The output file must fit into the memory of the goods machine, otherwise there may be a problem with the output file section. Thank!!

+1
Oct 27 '15 at 10:20
source share

Why not use a pig script like this to merge partition files:

 stuff = load "/path/to/dir/*" store stuff into "/path/to/mergedir" 
0
Dec 21 '13 at 3:40
source share

If the files have a header, you can get rid of it by following these steps:

 hadoop fs -cat /path/to/hdfs/job-output/part-* | grep -v "header" > output.csv 

then add the header manually for output.csv

0
Jan 18 '17 at 18:12
source share

. Does map / reduce contain merging these files?

No. He does not merge.

You can use IdentityReducer to achieve your goal.

Performs a reduction by writing all input values ​​directly to the output.

 public void reduce(K key, Iterator<V> values, OutputCollector<K,V> output, Reporter reporter) throws IOException 

Writes all keys and values ​​directly to output.

Take a look at the related posts:

hadoop: difference between gear 0 and identity gear?

0
Jan 19 '17 at 2:42 on
source share



All Articles