Specifying the output file name in Apache Spark

I have a MapReduce job that I'm trying to migrate to PySpark. Is there a way to determine the name of the output file rather than getting it part-xxxxx?

In MR, I used a class for this org.apache.hadoop.mapred.lib.MultipleTextOutputFormat,

PS: I tried the method saveAsTextFile(). For instance:

lines = sc.textFile(filesToProcessStr)
counts = lines.flatMap(lambda x: re.split('[\s&]', x.strip()))\
.saveAsTextFile("/user/itsjeevs/mymr-output")

This will create the same files part-0000.

[13:46:25] [spark] $ hadoop fs -ls /user/itsjeevs/mymr-output/
Found 3 items
-rw-r-----   2 itsjeevs itsjeevs          0 2014-08-13 13:46 /user/itsjeevs/mymr-output/_SUCCESS
-rw-r--r--   2 itsjeevs itsjeevs  101819636 2014-08-13 13:46 /user/itsjeevs/mymr-output/part-00000
-rw-r--r--   2 itsjeevs itsjeevs   17682682 2014-08-13 13:46 /user/itsjeevs/mymr-output/part-00001

EDIT

I recently read an article that will make life easier for Spark users.

+6
source share
2 answers

Spark also uses Hadoop under the hood, so you can probably get what you want. Here's how to implement it saveAsTextFile:

def saveAsTextFile(path: String) {
  this.map(x => (NullWritable.get(), new Text(x.toString)))
    .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)
}

OutputFormat saveAsHadoopFile. , Python. .

+4

:

MyFileName--00000 MyFileName--00001

        SparkConf sparkConf = new SparkConf().setAppName("WCSYNC-FileCompressor-ClusterSaver");
        SparkContext sc = new SparkContext(sparkConf);
            JavaSparkContext context = new JavaSparkContext(sc)
context.hadoopConfiguration().set("mapreduce.output.basename", "myfilename");




saveAsNewAPIHadoopFile(outputpath,
                                Text.class,
                                Text.class,
                                TextOutputFormat.class,
                                context.hadoopConfiguration());
0

All Articles