Hadoop works better with a small number of large files, unlike a huge number of small files. (โSmallโ here means significantly smaller than the Hadoop Distributed File System (HDFS) unit. โNumberโ means a range of up to 1000 s).
This means that if you have a 1000 MB file, the Map-reduce task, based on the normal TextInputFormat , will create 1000 Map tasks, for each of these map tasks it will take a certain amount of time to start and end. This delay when creating a task can reduce productivity.
In a cluster with multiple tenants with limited resources, obtaining a large number of card slots will also be difficult.
For more information see the link for more information and test results.
donut
source share