How to dynamically select spark.sql.shuffle.partitions

I am currently processing data using the spark and foreach section, opening a connection to mysql and inserting it into the database in batch 1000. As mentioned in SparkDocumentation , the default spark.sql.shuffle.partitionsvalue is 200, but I want to keep it dynamic. So how do I calculate it. Therefore, when choosing a very high value that causes performance degradation, or when choosing a very small value that causes OOM.

+4
source share
2 answers

Try the below option -

val numExecutors         = spark.conf.get("spark.executor.instances").toInt

val numExecutorsCores    = spark.conf.get("spark.executor.cores").toInt

val numShufflePartitions = (numExecutors * numExecutorsCores)

spark.conf.set("spark.sql.shuffle.partitions", numShufflePartitions)

, , .

, -

spark.conf.set("spark.executor.memoryOverhead", "3G")

- Dataframe didvie hdfs hdfs spark.sql.shuffle.partitions.

0

df.repartition(numPartitions). / numPartitions repartition().

df.repartition(numPartitions)   or rdd.repartition(numPartitions)
-3

All Articles