Increases spark caching performance

So, I am doing several operations on the same rdd in the kafka thread. Caches that RDD is going to improve performance?

+5
source share
2 answers

When performing multiple operations on the same thread, cache will significantly improve performance. This can be seen on the Spark interface:

Without cache each iteration in dstream will take place at the same time, so the total data processing time in each packet interval will be linear with respect to the number of iterations over the data: Spark streaming no cache

When cache used, the first time you run the conversion pipeline to RDD, the RDD will be cached, and each subsequent iteration on this RDD will take only a fraction of the time.

(In this screenshot, the execution time of the same work was reduced from 3 to 0.4 s due to the reduction in the number of partitions) Spark streaming with cache

Instead of using dstream.cache I would recommend using dstream.foreachRDD or dstream.transform to directly access the underlying RDD and apply the persist operation. We use persist and unpersist around iterative code to clear memory as soon as possible:

 dstream.foreachRDD{rdd => rdd.cache() col.foreach{id => rdd.filter(elem => elem.id == id).map(...).saveAs...} rdd.unpersist(true) } 

Otherwise, you need to wait for the time configured for spark.cleaner.ttl to clear the memory.

Note that the default value for spark.cleaner.ttl infinite, which is not recommended for the 24x7 Spark Streaming job.

+4
source

Spark also supports pulling datasets into the cluster's internal cache memory. This is very useful when re-accessing data, for example when querying a small “hot” dataset or when starting an iterative algorithm like PageRank.

https://spark.apache.org/docs/latest/quick-start.html#caching

+1
source

All Articles