Spark cache stores only part of RDD

Question

Spark cache stores only part of RDD

When I explicitly call rdd.cache, I see on the storage tab of the spark console that only part of rdd is actually cached. My question is where are the rest of the parts? How does Spark decide which part to leave in the cache?

The same question applies to the original source data read by sc.textFile (). I understand that these rdd are automatically cached, even though the spark console storage table does not display information about the status of their cache. Do we know how many are cached or missing?

+5

caching swap apache-spark

bhomass Apr 7 '15 at 10:07

source share

1 answer

stholzm · Answer 1 · 2015-04-11T13:27:31+0000

cache() same as persist(StorageLevel.MEMORY_ONLY) , and your data volume probably exceeds the available memory. Spark then evicts the caches in the "least recently used" way.

You can configure backup memory for caching by specifying configuration parameters. For more information see the Spark Documentation and note: spark.driver.memory , spark.executor.memory , spark.storage.memoryFraction

Not an expert, but I don't think textFile() automatically caches anything; Spark Quick Start explicitly caches the RDD text file: sc.textFile(logFile, 2).cache()

Spark cache stores only part of RDD

More articles: