Spark does not support nested RDDs or user-defined functions that relate to other RDDs, hence NullPointerException; see this thread on the spark-users mailing list .
It looks like your current code is trying to group d elements by value; you can do this efficiently with the groupBy() RDD method:
scala> val d = sc.parallelize(Seq("Hello", "World", "Hello")) d: spark.RDD[java.lang.String] = spark.ParallelCollection@55c0c66a scala> d.groupBy(x => x).collect() res6: Array[(java.lang.String, Seq[java.lang.String])] = Array((World,ArrayBuffer(World)), (Hello,ArrayBuffer(Hello, Hello)))
Josh Rosen Jan 02 '13 at 22:52 2013-01-02 22:52
source share