I use pyspark to clear the data. A very common operation is to take a small number of subsets of a file and export it for verification:
(self.spark_context.textFile(old_filepath+filename)
.takeOrdered(100)
.saveAsTextFile(new_filepath+filename))
My problem is that takeOrdered returns a list instead of an RDD, so saveAsTextFile does not work.
AttributeError: 'list' object has no attribute 'saveAsTextFile'
Of course, I could implement my own file writer. Or I could convert the list back to parallelized RDD. But I'm trying to be a spark purist.
Is there no way to return an RDD from a takeOrdered or equivalent function?
source
share