How to use MultipleTextOutputFormat using the new Hadoop API?

I would like to write some output files. How to do this using Job instead of JobConf?

+4
source share
2 answers

The docs say instead of org.apache.hadoop.mapreduce.lib.output.MultipleOutputs

Below is a snippet of code that uses MultipleOutput. Unfortunately, I did not write this and did not spend much time on it ... Therefore, I do not know exactly why everything happens. I share the hope this helps. :)

Job Setting

 job.setJobName("Job Name"); job.setJarByClass(ETLManager.class); job.setMapOutputKeyClass(Text.class); job.setOutputKeyClass(NullWritable.class); job.setMapOutputValueClass(MyThing.class); job.setMapperClass(MyThingMapper.class); job.setReducerClass(MyThingReducer.class); MultipleOutputs.addNamedOutput(job, Constants.MyThing_NAMED_OUTPUT, TextOutputFormat.class, NullWritable.class, Text.class); job.setInputFormatClass(MyInputFormat.class); FileInputFormat.addInputPath(job, new Path(conf.get("input"))); FileOutputFormat.setOutputPath(job, new Path(String.format("%s/%s", conf.get("output"), Constants.MyThing_NAMED_OUTPUT))); 

Gear setting

 public class MyThingReducer extends Reducer<Text, MyThing, NullWritable, NullWritable> { private MultipleOutputs m_multipleOutputs; @Override public void setup(Context context) { m_multipleOutputs = new MultipleOutputs(context); } @Override public void cleanup(Context context) throws IOException, InterruptedException { if (m_multipleOutputs != null) { m_multipleOutputs.close(); } } @Override public void reduce(Text key, Iterable<MyThing> values, Context context)throws IOException, InterruptedException { for (MyThing myThing : values) { m_multipleOutputs.write(Constants.MyThing_NAMED_OUTPUT, EMPTY_KEY, generateData(context, myThing), generateFileName(context, myThing)); context.progress(); } } } 

EDIT: Added link to MultipleOutputs.

+1
source

easy way to create key-based output file names

  input data type //key //value cupertino apple sunnyvale banana cupertino pear 

Class MultipleTextOutputFormat

 static class KeyBasedMultipleTextOutputForma extends MultipleTextOutputFormat<Text, Text> { @Override protected String generateFileNameForKeyValue(Text key, Text value, String name) { return key.toString() + "/" + name; } } 

Job Configuration

  job.setOutputFormat(KeyBasedMultipleTextOutputFormat.class); 

Run this code and you will see the following files in HDFS, where / output is the job output directory:

  $ hadoop fs -ls /output /output/cupertino/part-00000 /output/sunnyvale/part-00000 

hopes this helps.

+1
source

All Articles