How to remake a compressed file in Apache Spark?

I have thousands of 2GB compressed files sitting in HDFS. I use spark to process these files. I use the Spark textFile () method to load files from HDFS. My question is: how can I redo the data so that I can process each file in parallel. Currently, each .gz file is processed in one task. So, if I process 1000 files, only 1000 tasks are performed. I understand that compressed files are not split. But is there any other approach that I could use to speed up my work?

+5
source share
1 answer

You can use rdd.repartition(#partitions) after downloading the file. This is related to the cost of shuffling, so you need to evaluate if the increase in parallelization performance is distributed randomly for this initial cost.

Another way is to perform any transformations (map, filter, ...) on the initial section and use any shuffling stage already present in your pipeline to redistribute RDD. eg.

 rdd.map().filter().flatMap().sortBy(f, numPartitions=new#ofpartitions) 
+3
source

All Articles