Correct file name and add to entry

I know this question has been asked before, but I cannot get a clear working answer.

result.saveAsTextFile(path); 
  • when using spark saveAsTextFile, the output is saved by the name "part-00", "part-01", etc. Can I change this name to a customized name?

  • Is it possible that saveAsTextFile will be added to an existing file, rather than overwriting it?

I use Java 7 for encoding, the output file system will be cloudy (Azure, Aws)

+5
source share
1 answer

1) There is no direct support in the saveAsTextFile method for managing the file output name. You can try using saveAsHadoopDataset to control the base name of the output file.

For example: instead of part-00000 you can get yourCustomName-00000.

Keep in mind that you cannot control the 00000 suffix with this method. This spark is automatically assigned to each section during recording so that each section writes to a unique file.

To control this, as mentioned in the comments above, you should write your own custom OutputFormat.

 SparkConf conf=new SparkConf(); conf.setMaster("local").setAppName("yello"); JavaSparkContext sc=new JavaSparkContext(conf); JobConf jobConf=new JobConf(); jobConf.set("mapreduce.output.basename", "customName"); jobConf.set("mapred.output.dir", "outputPath"); JavaRDD<String> input = sc.textFile("inputDir"); input.saveAsHadoopDataset(jobConf); 

2) A workaround would be to record the output, as well as the output location, and use Hadoop FileUtil.copyMerge to form the merged file.

0
source

All Articles