The problem is that Any is too generic and Spark just doesn't know how to serialize it. You must explicitly specify a specific type, in your case, Integer . Since null cannot be assigned to primitive types in Scala, you can use java.lang.Integer . So try the following:
val account = sc.parallelize(Seq( (1, null.asInstanceOf[Integer], 2,"F"), (2, new Integer(2), 4, "F"), (3, new Integer(3), 6, "N"), (4, null.asInstanceOf[Integer],8,"F")))
Here is the result:
rdd: org.apache.spark.rdd.RDD[(Int, Integer, Int, String)] = ParallelCollectionRDD[0] at parallelize at <console>:24
And the corresponding DataFrame:
scala> val df = rdd.toDF("ACCT_ID", "M_CD", "C_CD","IND") df: org.apache.spark.sql.DataFrame = [ACCT_ID: int, M_CD: int ... 2 more fields] scala> df.show +-------+----+----+---+ |ACCT_ID|M_CD|C_CD|IND| +-------+----+----+---+ | 1|null| 2| F| | 2| 2| 4| F| | 3| 3| 6| N| | 4|null| 8| F| +-------+----+----+---+
You might also consider a cleaner way to declare a null integer value, for example:
object Constants { val NullInteger: java.lang.Integer = null }