Spark RDD with common pointers in each section (and a magic number of 200?)

Question

Spark RDD with common pointers in each section (and a magic number of 200?)

I am trying to maintain a spark RDD in which the elements of each section share one large object. However, this object seems to be stored in memory several times. Reducing my problem to a game case of only one section with 200 elements:

val nElements = 200
class Elem(val s:Array[Int])

val rdd = sc.parallelize(Seq(1)).mapPartitions( _ => {
    val sharedArray = Array.ofDim[Int](10000000) // Should require ~40MB
    (1 to nElements).toIterator.map(i => new Elem(sharedArray))
}).cache()

rdd.count() //force computation

This consumes the expected amount of memory, as shown in the logs:

storage.MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 38.2 MB, free 5.7 GB)

However, 200 is the maximum number of elements for which this is so. Installation nElements=201gives:

storage.MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 76.7 MB, free 5.7 GB)

What causes this? Where does this magic number 200 come from and how to increase it?

EDIT FOR CLARIFICATIONS:

println , . , :

rdd.map(_.s.hashCode).min == rdd.map(_.s.hashCode).max  // returns true

.. , 10000000 , . , nExamples (, 20000), .

storage.MemoryStore: rdd_1_0 ! ( 6.1 )

nExamples=500, rdd 1907,4 , , .

+4

scala apache-spark rdd

Luke Hewitt 22 . '14 16:35

1

Luke Hewitt · Answer 1 · 2014-12-09T23:55:25+0000

- , ( ). rdd.cache() :

def cached[T: ClassTag](rdd:RDD[T]) = {
    rdd.mapPartitions(p => 
        Iterator(p.toSeq)
    ).cache().mapPartitions(p =>
        p.next().toIterator
    )
}

cached(rdd) RDD, ""

Spark RDD with common pointers in each section (and a magic number of 200?)

More articles: