Implementation for CombineFileInputFormat Hadoop 0.20.205

Question

Implementation for CombineFileInputFormat Hadoop 0.20.205

Can someone point out where I could find an implementation for CombineFileInputFormat (org. Using Hadoop 0.20.205? This is to create large sections from very small log files (text in lines) using EMR.

Surprisingly, Hadoop does not have a default implementation for this class created specifically for this purpose, and googling looks like I'm not the only one who confuses this. I need to compile a class and lay it out in a jar for bank streaming, with limited knowledge of Java, this is a definite problem.

Edit: I already tried the yetitrails example with the necessary import, but I get a compiler error for the following method.

+5

java hadoop

vladimir montealegre Jan 11 '13 at 1:38

source share

1 answer

Amar · Accepted Answer · 2013-01-11T18:22:26+0000

Here is the implementation I have for you:

 import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileSplit; import org.apache.hadoop.mapred.InputSplit; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.LineRecordReader; import org.apache.hadoop.mapred.RecordReader; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.lib.CombineFileInputFormat; import org.apache.hadoop.mapred.lib.CombineFileRecordReader; import org.apache.hadoop.mapred.lib.CombineFileSplit; @SuppressWarnings("deprecation") public class CombinedInputFormat extends CombineFileInputFormat<LongWritable, Text> { @SuppressWarnings({ "unchecked", "rawtypes" }) @Override public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf conf, Reporter reporter) throws IOException { return new CombineFileRecordReader(conf, (CombineFileSplit) split, reporter, (Class) myCombineFileRecordReader.class); } public static class myCombineFileRecordReader implements RecordReader<LongWritable, Text> { private final LineRecordReader linerecord; public myCombineFileRecordReader(CombineFileSplit split, Configuration conf, Reporter reporter, Integer index) throws IOException { FileSplit filesplit = new FileSplit(split.getPath(index), split.getOffset(index), split.getLength(index), split.getLocations()); linerecord = new LineRecordReader(conf, filesplit); } @Override public void close() throws IOException { linerecord.close(); } @Override public LongWritable createKey() { // TODO Auto-generated method stub return linerecord.createKey(); } @Override public Text createValue() { // TODO Auto-generated method stub return linerecord.createValue(); } @Override public long getPos() throws IOException { // TODO Auto-generated method stub return linerecord.getPos(); } @Override public float getProgress() throws IOException { // TODO Auto-generated method stub return linerecord.getProgress(); } @Override public boolean next(LongWritable key, Text value) throws IOException { // TODO Auto-generated method stub return linerecord.next(key, value); } } }

In your task, the mapred.max.split.size parameter is first set according to the size you want to merge the input files into. Do something like the following in your run ():

 ... if (argument != null) { conf.set("mapred.max.split.size", argument); } else { conf.set("mapred.max.split.size", "134217728"); // 128 MB } ... conf.setInputFormat(CombinedInputFormat.class); ...

Implementation for CombineFileInputFormat Hadoop 0.20.205

More articles: