Scalaz class classes for Apache Spark RDD

The goal is to implement various types of classes (e.g. Semigroup, Monad, Functor, etc.) provided by Scalaz for Spark RDD (Distributed Collection). Unfortunately, I cannot make any of the type classes that would occupy higher types (e.g. Monad, Functor, etc.) to work well with RDD.

RDDs are defined (simplified) as:

abstract class RDD[T: ClassTag](){ def map[U: ClassTag](f: T => U): RDD[U] = {...} } 

Full code for RDD can be found here .

Here is one example that works great:

 import scalaz._, Scalaz._ import org.apache.spark.rdd.RDD implicit def semigroupRDD[A] = new Semigroup[RDD[A]] { def append(x:RDD[A], y: => RDD[A]) = x.union(y) } 

Here is one example that does not work:

 implicit def functorRDD = new Functor[RDD] { override def map[A, B](fa: RDD[A])(f: A => B): RDD[B] = { fa.map(f) } } 

This fails:

Error: No ClassTag for B fa.map (e)

The error is pretty clear. A map implemented in RDD expects a ClassTag (see above). Functor ScalaZ / monads, etc., Do not have ClassTag. Is it possible to make this work without changing Scalaz and / or Spark?

+6
source share
1 answer

Short answer: no

For the types of types, such as Functor , the restriction is that for any A and B , without limitation, for a given A => B you have a function, raised RDD[A] => RDD[B] . In Spark, you cannot select arbitrary A and B , since you need a ClassTag for B , as you saw.

For other types of classes, such as Semigroup , where the type does not change during the operation and therefore does not need a ClassTag , it works.

+10
source

All Articles