Create a nullable DataFrame for multiple columns

Question

Create a nullable DataFrame for multiple columns

I am trying to create a DataFrame using RDD .

First I create an RDD using the code below -

 val account = sc.parallelize(Seq( (1, null, 2,"F"), (2, 2, 4, "F"), (3, 3, 6, "N"), (4,null,8,"F")))

It works great -

: org.apache.spark.rdd.RDD [(Int, Any, Int, String)] = ParallelCollectionRDD [0] when parallelized at the address: 27

but when trying to create a DataFrame from RDD using below code

 account.toDF("ACCT_ID", "M_CD", "C_CD","IND")

I get below the error

java.lang.UnsupportedOperationException: schema for type Any is not supported

I analyzed that whenever I add a null value to Seq , then only I got an error.

Is it possible to add a null value?

+7

scala apache-spark apache-spark-dataset spark-dataframe

Avijit Sep 13 '16 at 7:24

source share

2 answers

Alternative way without using RDD:

 import spark.implicits._ val df = spark.createDataFrame(Seq( (1, None, 2, "F"), (2, Some(2), 4, "F"), (3, Some(3), 6, "N"), (4, None, 8, "F") )).toDF("ACCT_ID", "M_CD", "C_CD","IND") df.show +-------+----+----+---+ |ACCT_ID|M_CD|C_CD|IND| +-------+----+----+---+ | 1|null| 2| F| | 2| 2| 4| F| | 3| 3| 6| N| | 4|null| 8| F| +-------+----+----+---+ df.printSchema root |-- ACCT_ID: integer (nullable = false) |-- M_CD: integer (nullable = true) |-- C_CD: integer (nullable = false) |-- IND: string (nullable = true)

+3

Gevorg Jun 13 '17 at 16:09

source share

Zyoma · Accepted Answer · 2016-09-13T07:49:40+0000

The problem is that Any is too generic and Spark just doesn't know how to serialize it. You must explicitly specify a specific type, in your case, Integer . Since null cannot be assigned to primitive types in Scala, you can use java.lang.Integer . So try the following:

 val account = sc.parallelize(Seq( (1, null.asInstanceOf[Integer], 2,"F"), (2, new Integer(2), 4, "F"), (3, new Integer(3), 6, "N"), (4, null.asInstanceOf[Integer],8,"F")))

Here is the result:

 rdd: org.apache.spark.rdd.RDD[(Int, Integer, Int, String)] = ParallelCollectionRDD[0] at parallelize at <console>:24

And the corresponding DataFrame:

 scala> val df = rdd.toDF("ACCT_ID", "M_CD", "C_CD","IND") df: org.apache.spark.sql.DataFrame = [ACCT_ID: int, M_CD: int ... 2 more fields] scala> df.show +-------+----+----+---+ |ACCT_ID|M_CD|C_CD|IND| +-------+----+----+---+ | 1|null| 2| F| | 2| 2| 4| F| | 3| 3| 6| N| | 4|null| 8| F| +-------+----+----+---+

You might also consider a cleaner way to declare a null integer value, for example:

 object Constants { val NullInteger: java.lang.Integer = null }

Create a nullable DataFrame for multiple columns

More articles: