How to use MultipleTextOutputFormat using the new Hadoop API?

Question

How to use MultipleTextOutputFormat using the new Hadoop API?

I would like to write some output files. How to do this using Job instead of JobConf?

+4

mapreduce hadoop

zzztimbo Jun 07 '11 at 10:23

source share

2 answers

Nija · Answer 1 · 2011-06-08T00:34:32+0000

The docs say instead of org.apache.hadoop.mapreduce.lib.output.MultipleOutputs

Below is a snippet of code that uses MultipleOutput. Unfortunately, I did not write this and did not spend much time on it ... Therefore, I do not know exactly why everything happens. I share the hope this helps. :)

Job Setting

 job.setJobName("Job Name"); job.setJarByClass(ETLManager.class); job.setMapOutputKeyClass(Text.class); job.setOutputKeyClass(NullWritable.class); job.setMapOutputValueClass(MyThing.class); job.setMapperClass(MyThingMapper.class); job.setReducerClass(MyThingReducer.class); MultipleOutputs.addNamedOutput(job, Constants.MyThing_NAMED_OUTPUT, TextOutputFormat.class, NullWritable.class, Text.class); job.setInputFormatClass(MyInputFormat.class); FileInputFormat.addInputPath(job, new Path(conf.get("input"))); FileOutputFormat.setOutputPath(job, new Path(String.format("%s/%s", conf.get("output"), Constants.MyThing_NAMED_OUTPUT)));

Gear setting

 public class MyThingReducer extends Reducer<Text, MyThing, NullWritable, NullWritable> { private MultipleOutputs m_multipleOutputs; @Override public void setup(Context context) { m_multipleOutputs = new MultipleOutputs(context); } @Override public void cleanup(Context context) throws IOException, InterruptedException { if (m_multipleOutputs != null) { m_multipleOutputs.close(); } } @Override public void reduce(Text key, Iterable<MyThing> values, Context context)throws IOException, InterruptedException { for (MyThing myThing : values) { m_multipleOutputs.write(Constants.MyThing_NAMED_OUTPUT, EMPTY_KEY, generateData(context, myThing), generateFileName(context, myThing)); context.progress(); } } }

EDIT: Added link to MultipleOutputs.

code.rider · Answer 2 · 2014-11-05T11:26:29+0000

easy way to create key-based output file names

  input data type //key //value cupertino apple sunnyvale banana cupertino pear

Class MultipleTextOutputFormat

 static class KeyBasedMultipleTextOutputForma extends MultipleTextOutputFormat<Text, Text> { @Override protected String generateFileNameForKeyValue(Text key, Text value, String name) { return key.toString() + "/" + name; } }

Job Configuration

  job.setOutputFormat(KeyBasedMultipleTextOutputFormat.class);

Run this code and you will see the following files in HDFS, where / output is the job output directory:

  $ hadoop fs -ls /output /output/cupertino/part-00000 /output/sunnyvale/part-00000

hopes this helps.

How to use MultipleTextOutputFormat using the new Hadoop API?

More articles: