If you want to use your own output format, you can also get the desired behavior using RDD.
Look at the following classes: FileOutputFormat , FileOutputCommitter
In the file output format, you have a method called checkOutputSpecs that checks if the output directory exists. In FileOutputCommitter, you have a commitJob, which usually transfers data from a temporary directory to its final location.
I have not been able to check it yet (I will do it as soon as I have a few free minutes), but theoretically: if I extend FileOutputFormat and override checkOutputSpecs to a method that does not throw an exception from the directory already exists, and edit the commitJob method of my custom output committer in order to execute the logic that I want (for example, to redefine some files, add others), than I can also achieve the desired behavior using RDD.
The output format is passed: saveAsNewAPIHadoopFile (which is the saveAsTextFile method, and also actually saves files). And the output committer is configured at the application level.
Michael Kopaniov Apr 6 '16 at 18:13 2016-04-06 18:13
source share