I appreciate the performance of the various methods for loading Parquet files into Spark, and the differences are stunning.
In our Parquet files, we have nested case classes like:
case class C(/* a dozen of attributes*/) case class B(/* a dozen of attributes*/, cs: Seq[C]) case class A(/* a dozen of attributes*/, bs: Seq[B])
It takes some time to load them from the Parquet files. So I did a test of various ways to load class classes from Parquet files and summarize fields using Spark 1.6 and 2.0.
Here is a brief description of the test I did:
val df: DataFrame = sqlContext.read.parquet("path/to/file.gz.parquet").persist() df.count() // Spark 1.6 // Play Json // 63.169s df.toJSON.flatMap(s => Try(Json.parse(s).as[A]).toOption) .map(_.fieldToSum).sum() // Direct access to field using Spark Row // 2.811s df.map(row => row.getAs[Long]("fieldToSum")).sum() // Some small library we developed that access fields using Spark Row // 10.401s df.toRDD[A].map(_.fieldToSum).sum() // Dataframe hybrid SQL API // 0.239s df.agg(sum("fieldToSum")).collect().head.getAs[Long](0) // Dataset with RDD-style code // 34.223s df.as[A].map(_.fieldToSum).reduce(_ + _) // Dataset with column selection // 0.176s df.as[A].select($"fieldToSum".as[Long]).reduce(_ + _) // Spark 2.0 // Performance is similar except for: // Direct access to field using Spark Row // 23.168s df.map(row => row.getAs[Long]("fieldToSum")).reduce(_ + _) // Some small library we developed that access fields using Spark Row // 32.898s f1DF.toRDD[A].map(_.fieldToSum).sum()
I understand why the performance of methods using Spark Row deteriorates when upgrading to Spark 2.0, because the Dataframe now a simple alias for Dataset[Row] . I believe that the cost of unifying interfaces.
On the other hand, I am very disappointed that the Dataset promise is not kept: the performance when using RDD-style coding ( map and flatMap s) is worse than when using Dataset like Dataframe with SQL-like DSL.
Basically, in order to have good performance, we need to give up type safety.
What is the reason for the difference between Dataset used as RDD and Dataset used as Dataframe ?
Is there a way to improve coding performance in Dataset to match RDD-style coding with SQL-style coding? For data engineering, it's much cleaner to have RDD-style coding.
In addition, to work with SQL-like DSL, you need to smooth out our data model and not use nested class classes. Is it right that good performance is only achieved using flat data models?
source share