Save Spark dataframe in Hive: the table cannot be read because "parquet is not a SequenceFile"

I would like to save data in a Spark frame (v 1.3.0) to a Hive table using PySpark.

The documentation says:

"spark.sql.hive.convertMetastoreParquet: If set to false, Spark SQL will use Hive SerDe for parquet tables instead of native support.

Looking at the Spark Tutorial , it seems that this property can be set:

from pyspark.sql import HiveContext sqlContext = HiveContext(sc) sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false") # code to create dataframe my_dataframe.saveAsTable("my_dataframe") 

However, when I try to query a stored table in Hive, it returns:

 hive> select * from my_dataframe; OK Failed with exception java.io.IOException:java.io.IOException: hdfs://hadoop01.woolford.io:8020/user/hive/warehouse/my_dataframe/part-r-00001.parquet not a SequenceFile 

How to save a table so that it is immediately readable in Hive?

+7
hive apache-spark pyspark apache-spark-sql
source share
2 answers

I was there...
An API is just misleading. DataFrame.saveAsTable does not create a Hive table, but an internal Spark table source.
It also stores something in the metaphor of the hive, but not what you intend.
This note was made by the spark user mailing list for Spark 1.3.

If you want to create a Hive table from Spark, you can use this approach:
1. Use Create Table ... via SparkSQL for the hive metastar.
2. Use DataFrame.insertInto(tableName, overwriteMode) for the actual data (Spark 1.3)

+14
source share

I hit this question last week and was able to find a workaround

Here is the story: I can see the table in Hive if I created a table without a By section:

 spark-shell>someDF.write.mode(SaveMode.Overwrite) .format("parquet") .saveAsTable("TBL_HIVE_IS_HAPPY") hive> desc TBL_HIVE_IS_HAPPY; OK user_id string email string ts string 

But Hive cannot understand the table schema (the schema is empty ...) if I do this:

 spark-shell>someDF.write.mode(SaveMode.Overwrite) .format("parquet") .saveAsTable("TBL_HIVE_IS_NOT_HAPPY") hive> desc TBL_HIVE_IS_NOT_HAPPY; # col_name data_type from_deserializer 

[Decision]:

 spark-shell>sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false") spark-shell>df.write .partitionBy("ts") .mode(SaveMode.Overwrite) .saveAsTable("Happy_HIVE")//Suppose this table is saved at /apps/hive/warehouse/Happy_HIVE hive> DROP TABLE IF EXISTS Happy_HIVE; hive> CREATE EXTERNAL TABLE Happy_HIVE (user_id string,email string,ts string) PARTITIONED BY(day STRING) STORED AS PARQUET LOCATION '/apps/hive/warehouse/Happy_HIVE'; hive> MSCK REPAIR TABLE Happy_HIVE; 

The problem is that the data data table created using the Dataframe API (partitionBy + saveAsTable) is not compatible with Hive. (see link). By setting spark.sql.hive.convertMetastoreParquet to false, as suggested in the doc , Spark only puts the data in HDFS, but does not create the table on Hive. And then you can manually go into the hive shell to create an external table with the correct layout and partition definition, indicating the location of the data. I tested this in Spark 1.6.1 and it worked for me. Hope this helps!

+2
source share

All Articles