I have a MapReduce job that I'm trying to migrate to PySpark. Is there a way to determine the name of the output file rather than getting it part-xxxxx?
In MR, I used a class for this org.apache.hadoop.mapred.lib.MultipleTextOutputFormat,
PS: I tried the method saveAsTextFile(). For instance:
lines = sc.textFile(filesToProcessStr)
counts = lines.flatMap(lambda x: re.split('[\s&]', x.strip()))\
.saveAsTextFile("/user/itsjeevs/mymr-output")
This will create the same files part-0000.
[13:46:25] [spark] $ hadoop fs -ls /user/itsjeevs/mymr-output/
Found 3 items
-rw-r----- 2 itsjeevs itsjeevs 0 2014-08-13 13:46 /user/itsjeevs/mymr-output/_SUCCESS
-rw-r--r-- 2 itsjeevs itsjeevs 101819636 2014-08-13 13:46 /user/itsjeevs/mymr-output/part-00000
-rw-r--r-- 2 itsjeevs itsjeevs 17682682 2014-08-13 13:46 /user/itsjeevs/mymr-output/part-00001
EDIT
I recently read an article that will make life easier for Spark users.
Jeevs source
share