What is the ideal number of gears on a Hadoop?

As indicated by the Hadoop wiki for calculating the ideal number of gearboxes is 0.95 or 1.75 * (nodes * mapred.tasktracker.tasks.maximum)

but when to choose 0.95 and when 1.75? What is the factor that was taken into account when deciding this factor ?

+4
source share
1 answer

Say you have 100 shortened slots available in your cluster.

With a load factor of 0.95, all reduction tasks 95 will be launched simultaneously, since there are enough slots for reduction for all tasks. This means that no tasks will be expected in the queue until one of the others ends. I would recommend this option when the reduction tasks are "small", i.e. End relatively quickly, or all require the same time, more or less.

On the other hand, with a load factor of 1.75, 100 reduction tasks will be launched at the same time as there are available slots available, and 75 residuals are waiting in line until the reduction slot is available. This provides better load balancing. , because if some tasks are β€œharder” than others, i.e. If they require more time, they will not be a bottleneck, as others shorten their intervals, instead of completing their tasks and waiting, they will now complete tasks in the queue. It also makes it easier to load each reduction task, as the map output data is distributed to more tasks.

If I can express my opinion, I’m not sure that these factors are always ideal. Often I use a coefficient in excess of 1.75 (sometimes even 4 or 5), because I deal with big data, and my data is not suitable for each machine, unless I set this coefficient higher, and load balancing is also better.

+4
source

All Articles