Write output to different hadoop folders

Question

Write output to different hadoop folders

I want to write two different types of output from one reducer to two different directories.

I can use the multipleoutputs function in hadoop to write to different files, but they both end up in the same output folder.

I want to write each file from the same code to another folder.

Is there any way to do this?

If I try to put, for example, "hello / testfile" as the second argument, it will display the invaid argument. Therefore, I can not write to different folders.

If the above case is not possible, is it possible for the device to display only certain files from the input folder?

Please help me.

Thanks in advance!

Thanks for the answer. I can read the file successfully using the then method above. But in distributed mode, I cannot do this. In the reducer, I have set:

mos.getCollector("data", reporter).collect(new Text(str_key), new Text(str_val));

(Using multiple outputs and in Job Conf: I tried using

FileInputFormat.setInputPaths(conf2, "/home/users/mlakshm/opchk285/data-r-00000*");

and

FileInputFormat.setInputPaths(conf2, "/home/users/mlakshm/opchk285/data*");

But this leads to the following error:

 cause:org.apache.hadoop.mapred.InvalidInputException: Input Pattern hdfs://mentat.cluster:54310/home/users/mlakshm/opchk295/data-r-00000* matches 0 files

+4

mapreduce hadoop

Mahalakshmi Lakshminarayanan Jul 14 '12 at 3:48

source share

4 answers

Question 1: Writing output files to different directories - you can do this using the following approaches:

1. Using the MultipleOutputs class:

Good thing you can create multiple named output files using MultipleOutput. As you know, we need to add this to our driver code.

 MultipleOutputs.addNamedOutput(job, "OutputFileName", OutputFormatClass, keyClass, valueClass);

The API provides two overloaded write methods to achieve this.

 multipleOutputs.write("OutputFileName", new Text(Key), new Text(Value));

Now, to write the output file to separate output directories, you need to use the overloaded write method with an additional parameter for the base output path.

 multipleOutputs.write("OutputFileName", new Text(key), new Text(value), baseOutputPath);

Remember to change baseOutputPath in each of your implementations.

2. Rename / move the file in the driver class:

This is probably the easiest hack to write output to multiple directories. Use multipleOutputs and write all output files to one output directory. But the file names must be different for each category.

Suppose you want to create 3 different sets of output files, the first step is to register the named output files in the driver:

 MultipleOutputs.addNamedOutput(job, "set1", OutputFormatClass, keyClass, valueClass); MultipleOutputs.addNamedOutput(job, "set2", OutputFormatClass, keyClass, valueClass); MultipleOutputs.addNamedOutput(job, "set3", OutputFormatClass, keyClass, valueClass);

In addition, create the various output directories or directory structure that you want in the driver code, along with the actual output directory:

 Path set1Path = new Path("/hdfsRoot/outputs/set1"); Path set2Path = new Path("/hdfsRoot/outputs/set2"); Path set3Path = new Path("/hdfsRoot/outputs/set3");

The final important step is to rename the output files based on their names. If the task completed successfully,

 FileSystem fileSystem = FileSystem.get(new Configuration); if (jobStatus == 0) { // Get the output files from the actual output path FileStatus outputfs[] = fileSystem.listStatus(outputPath); // Iterate over all the files in the output path for (int fileCounter = 0; fileCounter < outputfs.length; fileCounter++) { // Based on each fileName rename the path. if (outputfs[fileCounter].getPath().getName().contains("set1")) { fileSystem.rename(outputfs[fileCounter].getPath(), new Path(set1Path+"/"+anyNewFileName)); } else if (outputfs[fileCounter].getPath().getName().contains("set2")) { fileSystem.rename(outputfs[fileCounter].getPath(), new Path(set2Path+"/"+anyNewFileName)); } else if (outputfs[fileCounter].getPath().getName().contains("set3")) { fileSystem.rename(outputfs[fileCounter].getPath(), new Path(set3Path+"/"+anyNewFileName)); } } }

Note. This will not add significant overhead to the task, because we only MOVE files from one directory to another. And the choice of a specific approach depends on the nature of your implementation.

Thus, this approach basically writes all the output files using different names to the same output directory, and when the job completes successfully, we rename the base output path and move the files to different output directories.

Question 2: Reading specific files from input folders:

You can definitely read specific input files from a directory using the MultipleInputs class.

Based on your input paths / file names, you can transfer the input files to the appropriate Mapper implementation.

Case 1: if all input ARE files are in the same directory:

 FileStatus inputfs[] = fileSystem.listStatus(inputPath); for (int fileCounter = 0; fileCounter < inputfs.length; fileCounter++) { if (inputfs[fileCounter].getPath().getName().contains("set1")) { MultipleInputs.addInputPath(job, inputfs[fileCounter].getPath(), TextInputFormat.class, Set1Mapper.class); } else if (inputfs[fileCounter].getPath().getName().contains("set2")) { MultipleInputs.addInputPath(job, inputfs[fileCounter].getPath(), TextInputFormat.class, Set2Mapper.class); } else if (inputfs[fileCounter].getPath().getName().contains("set3")) { MultipleInputs.addInputPath(job, inputfs[fileCounter].getPath(), TextInputFormat.class, Set3Mapper.class); } }

Case 2: If all the input files DO NOT belong in the same directory:

In principle, we can use the same approach, even if the input files are in different directories. Go to the base input path and check the file path name to match the criteria.

Or, if the files are in different places, the easiest way is to add to the individual inputs individually.

 MultipleInputs.addInputPath(job, Set1_Path, TextInputFormat.class, Set1Mapper.class); MultipleInputs.addInputPath(job, Set2_Path, TextInputFormat.class, Set2Mapper.class); MultipleInputs.addInputPath(job, Set3_Path, TextInputFormat.class, Set3Mapper.class);

Hope this helps! Thanks.

+2

naveenkumarbv May 20, '16 at 17:11

source share

Yes, you can specify that the input format only processes certain files:

 FileInputFormat.setInputPaths(job, "/path/to/folder/testfile*");

If you make changes to the code, remember that the _SUCCESS file must be written to both folders upon successful completion of the work - while this is not a requirement, it is a mechanism by which someone can determine if the exit is completed in this folder, not truncated due to an error.

+1

Chris white Jul 14 '12 at 11:38

source share

Yes you can do it. All you have to do is generate a file name for a specific key / value pair coming out of the reducer.

If you override the method, you can return the file name depending on which key / value pair you get, and so on. Here is a link that shows you how to do this.

https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CFMQFjAA&url=https%3A%2F%2Fsites.google.com%2Fsite%2Fhadoopandhive%2Fhome to-write-output-to-multiple-named-files-in-hadoop-using-multipletextoutputformat & ei = y7YBULarN8iIrAf4iPSOBg & usg = AFQjCNHbd8sRwlY1-My2gNYI0yqw4254YQ

0

London guy Jul 14 '12 at 18:24

source share

Judge mental · Accepted Answer · 2012-07-14T04:07:33+0000

Copy the MultipleOutputs code into the code base and relax the restriction on valid characters. In any case, I do not see any reason for the restrictions.

Write output to different hadoop folders

More articles: