Is this a regression bug in Spark 1.3?

Question

Is this a regression bug in Spark 1.3?

Without warnings about obsolescence in intrinsic safety of SQL 1.2.1, the following code stops working in version 1.3

Works in version 1.2.1 (without failure warnings)

val sqlContext = new HiveContext(sc) import sqlContext._ val jsonRDD = sqlContext.jsonFile(jsonFilePath) jsonRDD.registerTempTable("jsonTable") val jsonResult = sql(s"select * from jsonTable") val foo = jsonResult.zipWithUniqueId().map { case (Row(...), uniqueId) => // do something useful ... } foo.registerTempTable("...")

Work stopped in 1.3.0 (it just doesn’t compile, and everything I did was changed to 1.3)

 jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method

The workaround does not work:

although this may give me RDD [Row]:

 jsonResult.rdd.zipWithUniqueId()

this will not work now, since RDD[Row] does not have a registerTempTable method, of course

  foo.registerTempTable("...")

Here are my questions

Is there a workaround? (for example, am I just doing it wrong?)
This is mistake? (I think that everything that stops compiling working in the previous version, without warning @deprecated is clearly a regression error)

+5

apache-spark apache-spark-sql

Eran medan Mar 24 '15 at 21:24

source share

1 answer

Michael armbrust · Accepted Answer · 2015-03-24T22:34:00+0000

This is not a mistake, but sorry for the confusion! Prior to Spark 1.3, Spark SQL was marked as an alpha component because the APIs were still in the stream. With Spark 1.3, we have finished and stabilized the API. A full description of what you need to do when porting can be found in the documentation .

I can also answer your specific questions and provide some justification why we made these changes.

Work stopped in 1.3.0 (it just doesn’t compile, and everything I did was changed to 1.3) jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method

DataFrames is now one unified interface for both Scala and Java. However, since we must maintain compatibility with the existing RDD API for the rest of 1.X, DataFrames not RDD s. To get an RDD view, you can call df.rdd or df.javaRDD

Also, since we were afraid of some confusion that might occur with implicit conversions, we made it so that you explicitly call rdd.toDF to invoke the conversion from RDD. However, this conversion only works automatically if your RDD contains objects that inherit from Product (for example, tuples or case classes).

Back to the original question, if you want to do conversions in strings with an arbitrary scheme, you need to explicitly tell Spark SQL about the data structure after your operation with the map (since the compiler cannot).

 import org.apache.spark.sql.types._ val jsonData = sqlContext.jsonRDD(sc.parallelize("""{"name": "Michael", "zip": 94709}""" :: Nil)) val newSchema = StructType( StructField("uniqueId", IntegerType) +: jsonData.schema.fields) val augmentedRows = jsonData.rdd.zipWithUniqueId.map { case (row, id) => Row.fromSeq(id +: row.toSeq) } val newDF = sqlContext.createDataFrame(augmentedRows, newSchema)

Is this a regression bug in Spark 1.3?

More articles: