Multiple Spark Applications with HiveContext

Having two separate pyspark applications that create an instance of HiveContext instead of SQLContext , one of the two applications fails:

Exception: (“You must build Spark with Hive.” Export “SPARK_HIVE = true” and run the assembly assembly / sbt ", Py4JJavaError (u'An error while calling None.org.apache.spark.sql.hive.HiveContext. \ N ', JavaObject id = o34039))

Another application completes successfully.

I am using Spark 1.6 from the Python API and want to use some Dataframe functions that are only supported using HiveContext (e.g. collect_set ). I had the same problem in version 1.5.2 and earlier.

This is enough to reproduce:

 import time from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext conf = SparkConf() sc = SparkContext(conf=conf) sq = HiveContext(sc) data_source = '/tmp/data.parquet' df = sq.read.parquet(data_source) time.sleep(60) 

sleep is just starting a script while I start another process.

If I have two instances of this script, the above error is displayed when reading the parquet file. When I replace HiveContext with SQLContext everything is fine.

Does anyone know why this is?

+7
hive apache-spark pyspark
source share
1 answer

By default, Hive (Context) uses the embedded Derby as a metastor. It is intended primarily for testing and supports only one active user. If you want to support multiple running applications, you must configure a standalone metastar. Hive currently supports PostgreSQL, MySQL, Oracle, and MySQL. The configuration details depend on the backend and option (local / remote), but as a rule, you will need:

Cloudera provides a comprehensive guide that may come in handy: Configuring a beehive metaphor .

It is theoretically also possible to create separate Derby metastabilities with the proper configuration (see the Hive Administrator Guide - Local / Built-in Metastar Database ) or use Derby in server mode .

For development, you can run applications in different working directories. This will create a separate metastore_db for each application and avoid the problem with multiple active users. Providing a separate hive configuration should work as well, but less useful in designing:

If the hive-site.xml parameter is not configured, the context automatically creates metastore_db in the current directory

+6
source share

All Articles