Can I get the first n elements of each RDD in Spark Streaming?

When using Spark Streaming, is it possible to get the first n elements of each RDD in a DStream? In the real world, my stream consists of several events with geotags, and I want to take 100 (or something else) that are closest to this point for further processing, but a simple example that shows that I'm trying to do something like :

 import org.apache.spark.SparkConf import org.apache.spark.streaming.dstream.ConstantInputDStream import org.apache.spark.streaming.{Seconds, StreamingContext} object take { def main(args: Array[String]) { val data = 1 to 10 val sparkConf = new SparkConf().setAppName("Take"); val streamingContext = new StreamingContext(sparkConf, Seconds(1)) val rdd = streamingContext.sparkContext.makeRDD(data) val stream = new ConstantInputDStream(streamingContext, rdd) // In the real world, do a bunch of stuff which results in an ordered RDD // This obviously doesn't work // val filtered = stream.transform { _.take(5) } // In the real world, do some more processing on the DStream stream.print() streamingContext.start() streamingContext.awaitTermination() } } 

I understand that I could return the top n results back to the driver quite easily, but this is not what I want to do in this case, since I need to continue processing on the RDD after it has filtered it.

+5
source share
1 answer

Why doesn't it work? I think your example is fine.

  • You must calculate the distance for each event.
  • Sort events by distance with multiple sections tailored to your amount of data.
  • Take the first 100 events from each section (so you shuffle a small portion of the source data), make the returned collection a new RDD with sparkContext.parallelize (data)
  • Sorting again with only one partition, so all data is shuffled into the same data set
  • Take the first 100 events, this is your top 100

The code for sorting is the same as in steps 2 and 4, you just change the number of sections.

Step 1 is performed on the DStream, steps 2-5 are executed on the RDD in the conversion operation.

+7
source

All Articles