I need to extract a table from Teradata (read-only access) to the parquet using Scala (2.11) / Spark (2.1.0). I am creating a DataFrame which I can load successfully
val df = spark.read.format("jdbc").options(options).load()
But df.show gives me a NullPointerException:
java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
I did a df.printSchema and I found out that the reason for this NPE is that the dataset contains null values ββfor the columns (nullable = false) (it looks like Teradata is giving me the wrong information). Indeed, I can achieve df.show if I omit the problematic columns.
So, I tried to specify a new schema with all columns set to (nullable = true) :
val new_schema = StructType(df.schema.map { case StructField(n,d,nu,m) => StructField(n,d,true,m) }) val new_df = spark.read.format("jdbc").schema(new_schema).options(options).load()
But then I got:
org.apache.spark.sql.AnalysisException: JDBC does not allow user-specified schemas.;
I also tried to create a new Dataframe from the previous one, specifying the desired schema:
val new_df = df.sqlContext.createDataFrame(df.rdd, new_schema)
But I still have NPE when you take action on data.
Any idea on how I can fix this?
scala dataframe teradata apache-spark apache-spark-sql
Raphdg
source share