I extract data from HDFS and save it to Spark RDD. Spark creates the number of partitions based on the number of HDFS blocks. This leads to a large number of empty partitions, which are also processed during the pipeline. To remove this overhead, I want to filter out all empty partitions from RDD. I know about merging and redoing, but there is no guarantee that all empty partitions will be deleted.
Is there any other way of doing this?
hadoop apache-spark pyspark rdd
user3898179
source share