Hadoop, MapReduce - Multiple I / O Paths

In my source file, when creating a Jar for my MapReduce job, I use the Hadoop-local command. I wanted to know if there was a way, and not specifically indicate the path for each file in my input folder that will be used in the MapReduce job, can I just point and transfer all the files from my input folder. This is due to the fact that the contents and number of files can change due to the nature of the MapReduce jobs that I am trying to configure, and since I do not know the specific number of files, except for the contents of these files, there is a way to transfer all files from the input folder to my program MapReduce, and then iterate over each file to calculate a specific function, which then sends the results to the reducer. I use only one Map / Reduce program and I code in Java. I can use the hadoop-moonshot command, but now I am working with hasoop-local.

Thanks.

+5
source share
1 answer

You do not need to transfer a separate file as input for the MapReduce Job.

The FileInputFormat class already provides an API for accepting a list of several files as an input to the map program.

 public static void setInputPaths(Job job, Path... inputPaths) throws IOException 

Add a path to the list of inputs to specify a map reduction. Options:

conf - Job configuration

path - the path that will be added to the input list to specify the map reduction.

Sample code from Apache tutorial

 Job job = Job.getInstance(conf, "word count"); FileInputFormat.addInputPath(job, new Path(args[0])); 

MultipleInputs provides below API.

 public static void addInputPath(Job job, Path path, Class<? extends InputFormat> inputFormatClass, Class<? extends Mapper> mapperClass) 

Add a path with custom InputFormat and Mapper to the list of inputs to specify a map reduction.

Related SE Question:

Can hadoop enter data from multiple directories and files

Refer to the MultipleOutputs API for your second request on multiple output paths.

 FileOutputFormat.setOutputPath(job, outDir); // Defines additional single text based output 'text' for the job MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class, LongWritable.class, Text.class); // Defines additional sequence-file based output 'sequence' for the job MultipleOutputs.addNamedOutput(job, "seq", SequenceFileOutputFormat.class, LongWritable.class, Text.class); 

Take a look at SE related questions regarding multiple output files.

Writing to multiple folders in hadoop?

hadoop method to send output to multiple directories

+1
source

All Articles