When to use the Spark DataFrame / Dataset API and when to use a simple RDD?

The Spark SQL DataFrame / Dataset execution engine has several extremely efficient time and space optimizations (such as InternalRow and codeGen). According to many documents, this is apparently a better option than RDD for most distributed algorithms.

However, I did some research on the source code and am still not convinced. I have no doubt that InternalRow is much more compact and can save a lot of memory. But running algorithms may not be faster than storing predefined expressions. Namely, it is indicated in the source code org.apache.spark.sql.catalyst.expressions.ScalaUDFthat each user function does 3 things:

  • convert the type of catalyst (used in InternalRow) to the type scala (used in GenericRow).
  • apply function
  • convert result from scala type to catalyst type

Apparently, this is even slower than just applying the function directly to RDD without any conversion. Can anyone confirm or refute my assumptions by analyzing and analyzing the code in real time?

Thank you so much for any suggestion or insight.

+4
source share
1 answer

From this Databricks blog article The Tale of Three Apache Spark APIs: RDD, DataFrames, and Datasets

When to use RDD?

RDD, :

  • , ;
  • , ;
  • , , ;
  • , , ;
  • , DataFrames .

High Performance Spark 3. DataFrames, Datasets Spark SQL , Dataframe/API Dataset RDD

enter image description here

Databricks , Dataframe RDD

enter image description here

+1

All Articles