Spark RDD with common pointers in each section (and a magic number of 200?)

I am trying to maintain a spark RDD in which the elements of each section share one large object. However, this object seems to be stored in memory several times. Reducing my problem to a game case of only one section with 200 elements:

val nElements = 200
class Elem(val s:Array[Int])

val rdd = sc.parallelize(Seq(1)).mapPartitions( _ => {
    val sharedArray = Array.ofDim[Int](10000000) // Should require ~40MB
    (1 to nElements).toIterator.map(i => new Elem(sharedArray))
}).cache()

rdd.count() //force computation    

This consumes the expected amount of memory, as shown in the logs:

storage.MemoryStore: Block rdd_1_0 stored as values ​​in memory (estimated size 38.2 MB, free 5.7 GB)

However, 200 is the maximum number of elements for which this is so. Installation nElements=201gives:

storage.MemoryStore: Block rdd_1_0 stored as values ​​in memory (estimated size 76.7 MB, free 5.7 GB)

What causes this? Where does this magic number 200 come from and how to increase it?


EDIT FOR CLARIFICATIONS:

println , . , :

rdd.map(_.s.hashCode).min == rdd.map(_.s.hashCode).max  // returns true

.. , 10000000 , . , nExamples (, 20000), .

storage.MemoryStore: rdd_1_0 ! ( 6.1 )

nExamples=500, rdd 1907,4 , , .

+4
1

- , ( ). rdd.cache() :

def cached[T: ClassTag](rdd:RDD[T]) = {
    rdd.mapPartitions(p => 
        Iterator(p.toSeq)
    ).cache().mapPartitions(p =>
        p.next().toIterator
    )
}

cached(rdd) RDD, ""

0

All Articles