I am trying to maintain a spark RDD in which the elements of each section share one large object. However, this object seems to be stored in memory several times. Reducing my problem to a game case of only one section with 200 elements:
val nElements = 200
class Elem(val s:Array[Int])
val rdd = sc.parallelize(Seq(1)).mapPartitions( _ => {
val sharedArray = Array.ofDim[Int](10000000) // Should require ~40MB
(1 to nElements).toIterator.map(i => new Elem(sharedArray))
}).cache()
rdd.count() //force computation
This consumes the expected amount of memory, as shown in the logs:
storage.MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 38.2 MB, free 5.7 GB)
However, 200 is the maximum number of elements for which this is so. Installation nElements=201gives:
storage.MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 76.7 MB, free 5.7 GB)
What causes this? Where does this magic number 200 come from and how to increase it?
EDIT FOR CLARIFICATIONS:
println , . , :
rdd.map(_.s.hashCode).min == rdd.map(_.s.hashCode).max
.. , 10000000 , . , nExamples (, 20000), .
storage.MemoryStore: rdd_1_0 ! ( 6.1 )
nExamples=500, rdd 1907,4 , , .