Hadoop: Does CombineFileInputFormat use performance improvement for small files?

Question

Hadoop: Does CombineFileInputFormat use performance improvement for small files?

I am new to hadoop and testing some tests on the local machine.

There were many solutions to solve a large number of small files . I am using CombinedInputFormat, which extends CombineFileInputFormat.

I see that the number of converters has changed from 100 to 25 using CombinedInputFormat. Should I expect any increase in performance since the number of converters has decreased?

I did the job of shrinking the map in many small files without . Combined input: 100 cards took 10 minutes

But when the work with reducing the map was done using CombinedInputFormat: 25 cards took 33 minutes .

Any help would be appreciated.

0

mapreduce hadoop

Astro Mar 19 '16 at 21:37

source share

1 answer

donut · Answer 1 · 2016-03-20T19:02:32+0000

Hadoop works better with a small number of large files, unlike a huge number of small files. (“Small” here means significantly smaller than the Hadoop Distributed File System (HDFS) unit. “Number” means a range of up to 1000 s).

This means that if you have a 1000 MB file, the Map-reduce task, based on the normal TextInputFormat , will create 1000 Map tasks, for each of these map tasks it will take a certain amount of time to start and end. This delay when creating a task can reduce productivity.

In a cluster with multiple tenants with limited resources, obtaining a large number of card slots will also be difficult.

For more information see the link for more information and test results.

Hadoop: Does CombineFileInputFormat use performance improvement for small files?

More articles: