Spark Is there any rule of thumb about the optimal number of RDD partitions and their number?

Is there any relationship between the number of elements contained in an RDD and its ideal number of sections?

I have an RDD that has thousands of partitions (because I load it from the source file, consisting of several small files, which I cannot fix, so I have to deal with it). I would like to redo it (or use the method coalesce). But I do not know in advance the exact number of events that the RDD will contain.
Therefore, I would like to do this in an automatic way. Something that will look like:

val numberOfElements = rdd.count()
val magicNumber = 100000
rdd.coalesce( numberOfElements / magicNumber)

Is there any rule of thumb about the optimal number of RDD partitions and their number?

Thank.

+4
source share
2 answers

No, because it is highly dependent on the application, resources and data. There are some hard limits ( like various 2 GB limits ), but otherwise you need to configure the task for the task. Some factors to consider:

  • single row / element size
  • the cost of a typical operation. If small sections and operations are cheap, then planning costs can be much higher than the cost of processing data.
  • the cost of processing a partition when performing operations with partitions (for example, for example).

, CombineFileInputFormat , /. :

sc.hadoopFile(
  path,
  classOf[CombineTextInputFormat],
  classOf[LongWritable], classOf[Text]
).map(_._2.toString)
+4

zero323, - . , avro, , 64 (totalVolume/64MB ~ ). , "" .. , hdfs (s3 )

, .

+1

All Articles