Spark: coalesces very slowly, even the output is very small

I have the following code in Spark:

myData.filter(t => t.getMyEnum() == null) .map(t => t.toString) .saveAsTextFile("myOutput") 

There are 2000+ files in the myOutput folder, but only a few t.getMyEnum () == null, so there are only very few output records. Since I don't want to search for only a few outputs in the 2000+ output files, I tried to combine the output using coalesce, as shown below:

 myData.filter(t => t.getMyEnum() == null) .map(t => t.toString) .coalesce(1, false) .saveAsTextFile("myOutput") 

Then the work becomes extremely slow! I wonder why it is so slow? There were only a few release reports in 2000+ sections? Is there a better way to solve this problem?

+5
source share
1 answer

if you do a sharp merge, for example. to numPartitions = 1, this can lead to the fact that your calculations will be performed on fewer nodes than you like (for example, one node in the case of numPartitions = 1). To avoid this, you can pass shuffle = true. This will add a shuffle step, but means that the current upstream partitions will be executed in parallel (for each current partition).

Note. If shuffle = true, you can actually merge with a large number of sections. This is useful if you have a small number of sections, say 100, potentially with a small number of anomalies. Calling coalesce (1000, shuffle = true) will result in 1000 sections with data allocated using a hash delimiter.

So try passing true to the coalesce function. i.e.

 myData.filter(_.getMyEnum == null) .map(_.toString) .coalesce(1, shuffle = true) .saveAsTextFile("myOutput") 
+8
source

All Articles