Hadoop working with input files from several directories

Question

Hadoop working with input files from several directories

I have a situation where I have several (100+ from 2-3 MB each) gz compressed files present in several directories. For example
A1 / B1 / C1 / incomplete 0000.gz
A2 / B2 / C2 / incomplete 0000.gz
A1 / B1 / C1 / part-0001.gz

I need to transfer all these files to one map job. From what I see, to use MultipleFileInputFormat, all input files must be in the same directory. Can I transfer multiple directories directly to a job?
If not, is it possible to efficiently put these files in one directory without naming a conflict or to combine these files into one compressed gz file.
Note. I use simple Java to implement Mapper and do not use Pig or hadoop streams.

Any help on the above issue would be greatly appreciated. Thanks,
Ankit

+8

input file hadoop

Ankit Jan 4 '11 at 11:48

source share

1 answer

bajafresh4life · Accepted Answer · 2011-01-04T14:47:43+0000

FileInputFormat.addInputPaths () can accept a list of several files separated by commas, for example

FileInputFormat.addInputPaths("foo/file1.gz,bar/file2.gz")

Hadoop working with input files from several directories

More articles: