Having two separate pyspark applications that create an instance of HiveContext instead of SQLContext , one of the two applications fails:
Exception: (“You must build Spark with Hive.” Export “SPARK_HIVE = true” and run the assembly assembly / sbt ", Py4JJavaError (u'An error while calling None.org.apache.spark.sql.hive.HiveContext. \ N ', JavaObject id = o34039))
Another application completes successfully.
I am using Spark 1.6 from the Python API and want to use some Dataframe functions that are only supported using HiveContext (e.g. collect_set ). I had the same problem in version 1.5.2 and earlier.
This is enough to reproduce:
import time from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext conf = SparkConf() sc = SparkContext(conf=conf) sq = HiveContext(sc) data_source = '/tmp/data.parquet' df = sq.read.parquet(data_source) time.sleep(60)
sleep is just starting a script while I start another process.
If I have two instances of this script, the above error is displayed when reading the parquet file. When I replace HiveContext with SQLContext everything is fine.
Does anyone know why this is?
hive apache-spark pyspark
karlson
source share