Spark SQL HiveContext - saveAsTable creates invalid schema

Question

Spark SQL HiveContext - saveAsTable creates invalid schema

I am trying to save a Dataframe in a persistent Hive table in Spark 1.3.0 (PySpark). This is my code:

sc = SparkContext(appName="HiveTest") hc = HiveContext(sc) peopleRDD = sc.parallelize(['{"name":"Yin","age":30}']) peopleDF = hc.jsonRDD(peopleRDD) peopleDF.printSchema() #root # |-- age: long (nullable = true) # |-- name: string (nullable = true) peopleDF.saveAsTable("peopleHive")

The hive output table is expected:

 Column Data Type Comments age long from deserializer name string from deserializer

But the actual Hive output table of the above code:

 Column Data Type Comments col array<string> from deserializer

Why isn't a Hive table the same layout as a DataFrame? How to achieve the expected result?

+7

hive apache-spark apache-spark-sql

Mirko May 14, '15 at 9:54

source share

1 answer

user3931226 · Accepted Answer · 2015-05-15T05:41:25+0000

This is not a wrong scheme. The hive cannot read the table created by Spark correctly, because it does not yet have the right parquet. If you execute sqlCtx.sql('desc peopleHive').show() , it should show the correct schema. Or you can use the spark-sql client instead of the hive. You can also use the create table syntax to create external tables, which works just like Hive, but Spark has much better parquet support.

Spark SQL HiveContext - saveAsTable creates invalid schema

More articles: