I hit this question last week and was able to find a workaround
Here is the story: I can see the table in Hive if I created a table without a By section:
spark-shell>someDF.write.mode(SaveMode.Overwrite) .format("parquet") .saveAsTable("TBL_HIVE_IS_HAPPY") hive> desc TBL_HIVE_IS_HAPPY; OK user_id string email string ts string
But Hive cannot understand the table schema (the schema is empty ...) if I do this:
spark-shell>someDF.write.mode(SaveMode.Overwrite) .format("parquet") .saveAsTable("TBL_HIVE_IS_NOT_HAPPY") hive> desc TBL_HIVE_IS_NOT_HAPPY; # col_name data_type from_deserializer
[Decision]:
spark-shell>sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false") spark-shell>df.write .partitionBy("ts") .mode(SaveMode.Overwrite) .saveAsTable("Happy_HIVE")//Suppose this table is saved at /apps/hive/warehouse/Happy_HIVE hive> DROP TABLE IF EXISTS Happy_HIVE; hive> CREATE EXTERNAL TABLE Happy_HIVE (user_id string,email string,ts string) PARTITIONED BY(day STRING) STORED AS PARQUET LOCATION '/apps/hive/warehouse/Happy_HIVE'; hive> MSCK REPAIR TABLE Happy_HIVE;
The problem is that the data data table created using the Dataframe API (partitionBy + saveAsTable) is not compatible with Hive. (see link). By setting spark.sql.hive.convertMetastoreParquet to false, as suggested in the doc , Spark only puts the data in HDFS, but does not create the table on Hive. And then you can manually go into the hive shell to create an external table with the correct layout and partition definition, indicating the location of the data. I tested this in Spark 1.6.1 and it worked for me. Hope this helps!
Yuan zhao
source share