Impact of RDD Performance on JavaRDD Conversion

Question

Impact of RDD Performance on JavaRDD Conversion

I have code similar to this and I wan to work with JavaRDD instead of RDD. So, I am doing the conversion here. I would like to know the performance impact of this conversion, especially when I deal with GB data.

RDD<String> textFile = sc.textFile(filePath, 2); JavaRDD<String> javaRDD = textFile.toJavaRDD();

Is this a wide or narrow transformation? What is the difference between JavaRDD and RDD?

+5

java scala apache-spark rdd

Balaji reddy May 28 '16 at 9:44

source share

1 answer

Tzach zohar · Accepted Answer · 2016-05-28T09:56:18+0000

There is no significant performance penalty - JavaRDD is a simple wrapper around RDD to make calls from Java code more convenient. It contains the original RDD ad declaration and calls this member method on any method call, for example (from JavaRDD.scala ):

 def cache(): JavaRDD[T] = wrapRDD(rdd.cache())

wrapRDD boils down to something like new JavaRDD[T](rdd) , so the only performance new JavaRDD[T](rdd) is to create a thin Java object for each method call, but this is completely insignificant since it does not execute for every element in RDD, but once for everything an object.

Impact of RDD Performance on JavaRDD Conversion

More articles: