Why is there always 200 tasks in groupByKey?

Question

Why is there always 200 tasks in groupByKey?

Whenever I do groupByKeyon RDD, it splits into 200 jobs, even if the source table is quite large, for example. 2k sections and tens of millions of lines.

In addition, the operation seems to be stuck in the last two tasks, which take a very long time to calculate.

Why is it 200? How to increase it and will it help?

+4

apache-spark

dmytro Jul 07 '15 at 10:22

source share

1 answer

dpeacock · Answer 1 · 2015-07-07T11:06:34+0000

This parameter comes from spark.sql.shuffle.partitions, which represents the number of partitions used in the grouping, and has a default value of 200 , but can be increased. This may help, it will depend on the cluster and data.

, , , . reduceByKey/combineByKey, groupByKey, -?

Why is there always 200 tasks in groupByKey?

More articles: