Is there a way to change the RDD replication rate in Spark?

From what I understand, in the RDD in the cluster there are several copies of the data, so in the event of a node failure, the program can recover. However, in cases where the likelihood of a failure is negligible, it would be expensive from memory to have multiple copies of the data in the RDD. So my question is: is there a parameter in Spark that can be used to reduce the RDD replication rate?

+6
source share
2 answers

First, note that Spark does not cache all of your RDD s, simply because applications can create many RDD s, and not all of them need to be reused. You should call .persist() or .cache() on them.

You can set the storage level with which you want to save RDD with myRDD.persist(StorageLevel.MEMORY_AND_DISK) . .cache() is short for .persist(StorageLevel.MEMORY_ONLY) .

The default storage StorageLevel.MEMORY_ONLY for persist really StorageLevel.MEMORY_ONLY for RDD in Java or Scala - but it's usually different in that you create a DStream (refer to your DStream constructor DStream ). If you use Python, this is StorageLevel.MEMORY_ONLY_SER .

The doc details the number of storage tiers and what they mean, but they basically are a shorthand configuration to point Spark to an object that extends the StorageLevel class. This way you can define your own with a replication rate of up to 40.

Note that some of the different predefined storage tiers save one copy of the RDD . In fact, this is true for all those whose names are not fabricated with _2 (except NONE ):

  • DISK_ONLY
  • MEMORY_ONLY
  • MEMORY_ONLY_SER
  • MEMORY_AND_DISK
  • MEMORY_AND_DISK_SER
  • OFF_HEAP

For one copy to one media, which they use, of course, if you want to have a single copy as a whole, you need to choose the storage level at the same level.

+6
source

As huitseeker said, unless you specifically ask Spark to save the RDD and specify StorageLevel that uses replication, it will not have multiple copies of the RDD partitions.

What reason is to keep the line of how a certain piece of data was calculated, so that when the / if node does not work, it only repeats the processing of the corresponding data needed to access the lost RDD sections. - In my experience, it basically works, although sometimes it’s faster to restart the task and then restore it

+1
source

All Articles