How are sections broken down into tasks in Spark?

If I split RDD into say 60, and I have 20 cores distributed on 20 machines, that is, 20 instances of single-core machines, then the number of tasks is 60 (equal to the number of partitions). Why is it beneficial to have one partition per core and have 20 tasks?

In addition, I performed an experiment in which I set the number of sections to 2, checking that 2 tasks are simultaneously displayed in the user interface; however, I was surprised that he switches instances when performing tasks, for example. node1 and node2 perform the first 2 tasks, then node6 and node8 perform the following set of two tasks, etc. I thought, setting the number of partitions to smaller than the kernels (and instances) in the cluster, then the program will use the minimum number of required instances. Can anyone explain this behavior?

+5
source share
1 answer

For the first question: you may want to have more complex tasks than necessary to load less memory at the same time. In addition, this can help with error tolerance, since less work needs to be done in the event of a failure. However, this is a parameter. In general, the answer depends on the type of workload (IO binding, memory binding, processor binding).

Regarding the second, I believe that version 1.3 has some code for dynamically requesting resources. I'm not sure which version the break is in, but older versions just ask for the exact resources with which you configure your driver. As for how the partition moves from one node to another, well, AFAIK, it will select the data for the task from the node, which has a local copy of this data on HDFS. Since hdfs has several copies (by default by default) of each data block, there are several options for starting any given part).

+2
source

All Articles