Return RDD from takeOrdered instead of list

Question

I use pyspark to clear the data. A very common operation is to take a small number of subsets of a file and export it for verification:

(self.spark_context.textFile(old_filepath+filename)
    .takeOrdered(100) 
    .saveAsTextFile(new_filepath+filename))

My problem is that takeOrdered returns a list instead of an RDD, so saveAsTextFile does not work.

AttributeError: 'list' object has no attribute 'saveAsTextFile'

Of course, I could implement my own file writer. Or I could convert the list back to parallelized RDD. But I'm trying to be a spark purist.

Is there no way to return an RDD from a takeOrdered or equivalent function?

+4

Abe Sep 01 '15 at 10:31

1 answer

yurib · Accepted Answer · 2015-09-01T22:49:16+0000

takeOrdered() , , RDD.
, sample().
, filter() sortByKey(), . , , takeOrdered()