Spark SQL HiveContext - saveAsTable creates invalid schema

I am trying to save a Dataframe in a persistent Hive table in Spark 1.3.0 (PySpark). This is my code:

sc = SparkContext(appName="HiveTest") hc = HiveContext(sc) peopleRDD = sc.parallelize(['{"name":"Yin","age":30}']) peopleDF = hc.jsonRDD(peopleRDD) peopleDF.printSchema() #root # |-- age: long (nullable = true) # |-- name: string (nullable = true) peopleDF.saveAsTable("peopleHive") 

The hive output table is expected:

 Column Data Type Comments age long from deserializer name string from deserializer 

But the actual Hive output table of the above code:

 Column Data Type Comments col array<string> from deserializer 

Why isn't a Hive table the same layout as a DataFrame? How to achieve the expected result?

+7
hive apache-spark apache-spark-sql
source share
1 answer

This is not a wrong scheme. The hive cannot read the table created by Spark correctly, because it does not yet have the right parquet. If you execute sqlCtx.sql('desc peopleHive').show() , it should show the correct schema. Or you can use the spark-sql client instead of the hive. You can also use the create table syntax to create external tables, which works just like Hive, but Spark has much better parquet support.

+6
source share

All Articles