Multiple Spark Applications with HiveContext

Question

Multiple Spark Applications with HiveContext

Having two separate pyspark applications that create an instance of HiveContext instead of SQLContext , one of the two applications fails:

Exception: (“You must build Spark with Hive.” Export “SPARK_HIVE = true” and run the assembly assembly / sbt ", Py4JJavaError (u'An error while calling None.org.apache.spark.sql.hive.HiveContext. \ N ', JavaObject id = o34039))

Another application completes successfully.

I am using Spark 1.6 from the Python API and want to use some Dataframe functions that are only supported using HiveContext (e.g. collect_set ). I had the same problem in version 1.5.2 and earlier.

This is enough to reproduce:

 import time from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext conf = SparkConf() sc = SparkContext(conf=conf) sq = HiveContext(sc) data_source = '/tmp/data.parquet' df = sq.read.parquet(data_source) time.sleep(60)

sleep is just starting a script while I start another process.

If I have two instances of this script, the above error is displayed when reading the parquet file. When I replace HiveContext with SQLContext everything is fine.

Does anyone know why this is?

+7

hive apache-spark pyspark

karlson Jan 10 '16 at 13:22

source share

1 answer

zero323 · Accepted Answer · 2016-01-10T14:27:29+0000

By default, Hive (Context) uses the embedded Derby as a metastor. It is intended primarily for testing and supports only one active user. If you want to support multiple running applications, you must configure a standalone metastar. Hive currently supports PostgreSQL, MySQL, Oracle, and MySQL. The configuration details depend on the backend and option (local / remote), but as a rule, you will need:

working RDBMS server
metastore database created using the provided scripts
correct hive configuration

Cloudera provides a comprehensive guide that may come in handy: Configuring a beehive metaphor .

It is theoretically also possible to create separate Derby metastabilities with the proper configuration (see the Hive Administrator Guide - Local / Built-in Metastar Database ) or use Derby in server mode .

For development, you can run applications in different working directories. This will create a separate metastore_db for each application and avoid the problem with multiple active users. Providing a separate hive configuration should work as well, but less useful in designing:

If the hive-site.xml parameter is not configured, the context automatically creates metastore_db in the current directory

Multiple Spark Applications with HiveContext

More articles: