When to use the Spark DataFrame / Dataset API and when to use a simple RDD?

Question

When to use the Spark DataFrame / Dataset API and when to use a simple RDD?

The Spark SQL DataFrame / Dataset execution engine has several extremely efficient time and space optimizations (such as InternalRow and codeGen). According to many documents, this is apparently a better option than RDD for most distributed algorithms.

However, I did some research on the source code and am still not convinced. I have no doubt that InternalRow is much more compact and can save a lot of memory. But running algorithms may not be faster than storing predefined expressions. Namely, it is indicated in the source code org.apache.spark.sql.catalyst.expressions.ScalaUDFthat each user function does 3 things:

convert the type of catalyst (used in InternalRow) to the type scala (used in GenericRow).
apply function
convert result from scala type to catalyst type

Apparently, this is even slower than just applying the function directly to RDD without any conversion. Can anyone confirm or refute my assumptions by analyzing and analyzing the code in real time?

Thank you so much for any suggestion or insight.

+4

apache-spark apache-spark-sql apache-spark-dataset spark-dataframe

tribbloid May 30, '16 at 20:08

source share

1 answer