When performing multiple operations on the same thread, cache will significantly improve performance. This can be seen on the Spark interface:
Without cache each iteration in dstream will take place at the same time, so the total data processing time in each packet interval will be linear with respect to the number of iterations over the data: 
When cache used, the first time you run the conversion pipeline to RDD, the RDD will be cached, and each subsequent iteration on this RDD will take only a fraction of the time.
(In this screenshot, the execution time of the same work was reduced from 3 to 0.4 s due to the reduction in the number of partitions) 
Instead of using dstream.cache I would recommend using dstream.foreachRDD or dstream.transform to directly access the underlying RDD and apply the persist operation. We use persist and unpersist around iterative code to clear memory as soon as possible:
dstream.foreachRDD{rdd => rdd.cache() col.foreach{id => rdd.filter(elem => elem.id == id).map(...).saveAs...} rdd.unpersist(true) }
Otherwise, you need to wait for the time configured for spark.cleaner.ttl to clear the memory.
Note that the default value for spark.cleaner.ttl infinite, which is not recommended for the 24x7 Spark Streaming job.
maasg source share