First, note that Spark does not cache all of your RDD s, simply because applications can create many RDD s, and not all of them need to be reused. You should call .persist() or .cache() on them.
You can set the storage level with which you want to save RDD with myRDD.persist(StorageLevel.MEMORY_AND_DISK) . .cache() is short for .persist(StorageLevel.MEMORY_ONLY) .
The default storage StorageLevel.MEMORY_ONLY for persist really StorageLevel.MEMORY_ONLY for RDD in Java or Scala - but it's usually different in that you create a DStream (refer to your DStream constructor DStream ). If you use Python, this is StorageLevel.MEMORY_ONLY_SER .
The doc details the number of storage tiers and what they mean, but they basically are a shorthand configuration to point Spark to an object that extends the StorageLevel class. This way you can define your own with a replication rate of up to 40.
Note that some of the different predefined storage tiers save one copy of the RDD . In fact, this is true for all those whose names are not fabricated with _2 (except NONE ):
- DISK_ONLY
- MEMORY_ONLY
- MEMORY_ONLY_SER
- MEMORY_AND_DISK
- MEMORY_AND_DISK_SER
- OFF_HEAP
For one copy to one media, which they use, of course, if you want to have a single copy as a whole, you need to choose the storage level at the same level.
source share