I have the following code in Spark:
myData.filter(t => t.getMyEnum() == null) .map(t => t.toString) .saveAsTextFile("myOutput")
There are 2000+ files in the myOutput folder, but only a few t.getMyEnum () == null, so there are only very few output records. Since I don't want to search for only a few outputs in the 2000+ output files, I tried to combine the output using coalesce, as shown below:
myData.filter(t => t.getMyEnum() == null) .map(t => t.toString) .coalesce(1, false) .saveAsTextFile("myOutput")
Then the work becomes extremely slow! I wonder why it is so slow? There were only a few release reports in 2000+ sections? Is there a better way to solve this problem?
source share