How to dynamically select spark.sql.shuffle.partitions

Question

How to dynamically select spark.sql.shuffle.partitions

I am currently processing data using the spark and foreach section, opening a connection to mysql and inserting it into the database in batch 1000. As mentioned in SparkDocumentation , the default spark.sql.shuffle.partitionsvalue is 200, but I want to keep it dynamic. So how do I calculate it. Therefore, when choosing a very high value that causes performance degradation, or when choosing a very small value that causes OOM.

+4

apache-spark apache-spark-1.6

Naresh Jun 06 '16 at 2:43

source share

2 answers

user2560327 · Answer 1 · 2019-06-22T21:26:18+0000

Try the below option -

val numExecutors         = spark.conf.get("spark.executor.instances").toInt

val numExecutorsCores    = spark.conf.get("spark.executor.cores").toInt

val numShufflePartitions = (numExecutors * numExecutorsCores)

spark.conf.set("spark.sql.shuffle.partitions", numShufflePartitions)

, , .

, -

spark.conf.set("spark.executor.memoryOverhead", "3G")

- Dataframe didvie hdfs hdfs spark.sql.shuffle.partitions.

Raju Bairishetti · Answer 2 · 2016-06-15T04:46:39+0000

df.repartition(numPartitions). / numPartitions repartition().

df.repartition(numPartitions)   or rdd.repartition(numPartitions)

How to dynamically select spark.sql.shuffle.partitions

More articles: