Remove empty sections from Spark RDD

I extract data from HDFS and save it to Spark RDD. Spark creates the number of partitions based on the number of HDFS blocks. This leads to a large number of empty partitions, which are also processed during the pipeline. To remove this overhead, I want to filter out all empty partitions from RDD. I know about merging and redoing, but there is no guarantee that all empty partitions will be deleted.

Is there any other way of doing this?

+7
hadoop apache-spark pyspark rdd
source share
1 answer

There is no easy way to simply delete empty partitions from RDD.

coalesce does not guarantee that empty sections will be deleted. If you have an RDD with 40 empty partitions and 10 data partitions, after rdd.coalesce(45) there will still be empty partitions.

The repartition method distributes data equally among all partitions, so there will be no empty partitions. If you have an RDD with 50 empty partitions and 10 data partitions and run rdd.repartition(20) , the data will be evenly divided into 20 partitions.

+1
source share

All Articles