SPARK-5063 is among the best error messages when trying to nest RDD operations that are not supported.
This is a usability problem, not a functional one. The root cause is nesting RDD operations, and the solution should break it down.
Here we are trying to join dRDD and mRDD . If the size of mRDD large, a rdd.join would be the recommended method otherwise, if mRDD is small, that is, it fits in the memory of each artist, we could collect it, broadcast it and make a “map” join.
Join
A simple join would look like this:
val rdd = sc.parallelize(Seq(Array("one","two","three"), Array("four", "five", "six"))) val map = sc.parallelize(Seq("one" -> 1, "two" -> 2, "three" -> 3, "four" -> 4, "five" -> 5, "six"->6)) val flat = rdd.flatMap(_.toSeq).keyBy(x=>x) val res = flat.join(map).map{case (k,v) => v}
If we want to use broadcasting, we first need to collect the permissions table value locally so that b / c is for all executors. NOTE RDD, which will be broadcast, MUST fit into the memory of the driver, as well as each artist.
Broadcast Card Side Connector
val rdd = sc.parallelize(Seq(Array("one","two","three"), Array("four", "five", "six"))) val map = sc.parallelize(Seq("one" -> 1, "two" -> 2, "three" -> 3, "four" -> 4, "five" -> 5, "six"->6))) val bcTable = sc.broadcast(map.collectAsMap) val res2 = rdd.flatMap{arr => arr.map(elem => (elem, bcTable.value(elem)))}