When using Spark Streaming, is it possible to get the first n elements of each RDD in a DStream? In the real world, my stream consists of several events with geotags, and I want to take 100 (or something else) that are closest to this point for further processing, but a simple example that shows that I'm trying to do something like :
import org.apache.spark.SparkConf import org.apache.spark.streaming.dstream.ConstantInputDStream import org.apache.spark.streaming.{Seconds, StreamingContext} object take { def main(args: Array[String]) { val data = 1 to 10 val sparkConf = new SparkConf().setAppName("Take"); val streamingContext = new StreamingContext(sparkConf, Seconds(1)) val rdd = streamingContext.sparkContext.makeRDD(data) val stream = new ConstantInputDStream(streamingContext, rdd)
I understand that I could return the top n results back to the driver quite easily, but this is not what I want to do in this case, since I need to continue processing on the RDD after it has filtered it.
source share