Hadoop working with input files from several directories


I have a situation where I have several (100+ from 2-3 MB each) gz compressed files present in several directories. For example
A1 / B1 / C1 / incomplete 0000.gz
A2 / B2 / C2 / incomplete 0000.gz
A1 / B1 / C1 / part-0001.gz

I need to transfer all these files to one map job. From what I see, to use MultipleFileInputFormat, all input files must be in the same directory. Can I transfer multiple directories directly to a job?
If not, is it possible to efficiently put these files in one directory without naming a conflict or to combine these files into one compressed gz file.
Note. I use simple Java to implement Mapper and do not use Pig or hadoop streams.

Any help on the above issue would be greatly appreciated. Thanks,
Ankit

+8
input file hadoop
source share
1 answer

FileInputFormat.addInputPaths () can accept a list of several files separated by commas, for example

FileInputFormat.addInputPaths("foo/file1.gz,bar/file2.gz") 
+16
source share

Source: https://habr.com/ru/post/650664/


All Articles