Save Spark SchemaRDD to the hive data warehouse

We have a lot of Json magazines, and we want to build our Hive data warehouse. It's easy to get Json logs in spark schemaRDD, and there is a saveAsTable method for schemaRDD, but it only works for schemaRDDs created from HiveContext, and not from regular SQLContext. This throws an exception when I try to save AsTable with a schemaRDD created from a Json file. Is there a way to make it “snap” to a HiveContext and store it in Hive? I do not see any obvious reason that cannot be fulfilled. I know there are options like saveAsParquetFile for saving data, but we really want to use Hive.

+4
source share
2 answers

So, do you have data in SchemaRDD? You can register JSON RDD in the bush context using

hc.registerRDDasTable (RDD, "myjsontable")

"myjsontable" now exists only in the context of the hive, data has not yet been saved. then you can do something like

hc.sql ("CREATE TABLE myhivejsontable AS SELECT * FROM myjsontable")

which will actually create your table in the hive. What format do you need to save it? I would recommend Parquet, as column storage would be more efficient for queries. If you want to save it as JSON, you can use Hive SerDe (I wrote here https://github.com/rcongiu/Hive-JSON-Serde )

Spark Hive, , json, : http://www.congiu.com/creating-nested-data-parquet-in-spark-sql/

+1

- JSON SerDe Hive, script, . Hive .

0

All Articles