Spark Streaming: how to periodically update cached RDDs?

Question

Spark Streaming: how to periodically update cached RDDs?

In the Spark streaming application, I want to match the value based on the dictionary that is extracted from the backend (ElasticSearch). I want to periodically update the dictionary periodically if it has been updated in the backend. It will look like a periodic Logstash filter update feature. How can I achieve this with Spark (for example, somehow canceling RDD every 30 seconds)?

+6

apache-spark spark-streaming

lairtech Jun 05 '16 at 4:39

source share

1 answer

maasg · Accepted Answer · 2016-06-06T08:56:51+0000

The best way I've found to do this is to recreate the RDD and maintain a mutable link to it. Spark Streaming is basically a planning platform on top of Spark. We can copy to the scheduler to periodically update the RDD. To do this, we use an empty DStream, which we plan only for the update operation:

def getData():RDD[Data] = ??? function to create the RDD we want to use af reference data val dstream = ??? // our data stream // a dstream of empty data val refreshDstream = new ConstantInputDStream(ssc, sparkContext.parallelize(Seq())).window(Seconds(refreshInterval),Seconds(refreshInterval)) var referenceData = getData() referenceData.cache() refreshDstream.foreachRDD{_ => // evict the old RDD from memory and recreate it referenceData.unpersist(true) referenceData = getData() referenceData.cache() } val myBusinessData = dstream.transform(rdd => rdd.join(referenceData)) ... etc ...

In the past, I also tried only with alternating cache() and unpersist() with no result (it is updated only once). RDD playback deletes all lines and provides a clean download of new data.

Spark Streaming: how to periodically update cached RDDs?

More articles: