Add Jar to standalone pyspark

Question

Add Jar to standalone pyspark

I run the pyspark program:

$ export SPARK_HOME= $ export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip $ python

And py code:

 from pyspark import SparkContext, SparkConf SparkConf().setAppName("Example").setMaster("local[2]") sc = SparkContext(conf=conf)

How to add jar dependencies like databricks csv jar? Using the command line, I can add the package as follows:

 $ pyspark/spark-submit --packages com.databricks:spark-csv_2.10:1.3.0

But I do not use any of them. The program is part of a large workflow that does not use spark-submit. I should be able to run mine. /foo.py program, and it should work.

I know that you can set the spark properties for extraClassPath, but do you need to copy the JAR files to each node?
I tried conf.set ("spark.jars", "jar1, jar2"), which also did not work with py4j CNF exception

+11

python apache-spark pyspark

Nora olsen Mar 03 '16 at 3:10

source share

5 answers

zero323 · Answer 1 · 2016-03-03T03:46:13+0000

Any dependencies can be passed using the spark.jars.packages parameter ( spark.jars should work with the $SPARK_HOME/conf/spark-defaults.conf property) in the $SPARK_HOME/conf/spark-defaults.conf parameter. This should be a comma-separated list of coordinates.

And the properties of the package or classpath must be set before starting the JVM, and this happens in SparkConf initialization . This means that the SparkConf.set method cannot be used here.

An alternative approach is to set PYSPARK_SUBMIT_ARGS environment variables before the SparkConf object initialization:

 import os from pyspark import SparkConf SUBMIT_ARGS = "--packages com.databricks:spark-csv_2.11:1.2.0 pyspark-shell" os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS conf = SparkConf() sc = SparkContext(conf=conf)

Briford wylie · Answer 2 · 2017-08-24T18:02:10+0000

There are many approaches here (setting ENV vars, adding to $ SPARK_HOME / conf / spark-defaults.conf, etc.), some answers already cover them. I wanted to add an additional answer for those using Jupyter Notebooks and create a Spark session from a laptop. Here's the solution that worked best for me (in my case, I wanted the Kafka package to load):

 spark = SparkSession.builder.appName('my_awesome')\ .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0')\ .getOrCreate()

Using this line of code, I did not need to do anything (no changes to ENV or conf).

Indrajit · Answer 3 · 2016-05-29T16:54:34+0000

Finally found the answer after several attempts. The answer is specific to the use of spark-DSC-banks. Create a folder on your hard drive D: \ Spark \ spark_jars. Put the following jars there:

spark-csv_2.10-1.4.0.jar (this is the version I'm using)
Common-CSV-1.1.jar
uniqueness-parsers-1.5.1.jar

2 and 3 are dependencies required by spark-csv, so these two files also need to be downloaded. Go to the conf directory where you downloaded Spark. In the spark-defaults.conf file, add the line:

spark.driver.extraClassPath D: / Spark / spark_jars / *

The asterisk must contain all cans. Now run Python, create a SparkContext, SQLContext, as usual. You should now be able to use spark-csv as

 sqlContext.read.format('com.databricks.spark.csv').\ options(header='true', inferschema='true').\ load('foobar.csv')

ximiki · Answer 4 · 2018-05-02T19:25:35+0000

I ran into a similar problem for another jar ("MongoDB connector for Spark", mongo-spark-connector ), but the big caveat is that I installed Spark using pyspark in conda ( conda install pyspark ). So all the help for Spark -specific answers was not entirely helpful. For those you install with conda , here is the process that I combined:

1) Find where pyspark/jars your pyspark/jars . Mines were on this path: ~/anaconda2/pkgs/pyspark-2.3.0-py27_0/lib/python2.7/site-packages/pyspark/jars .

2) Download the jar file to the path found in step 1 from this location .

3) Now you should be able to run something like this (code taken from the official MongoDB tutorial using Breeford Wiley's answer above ):

 from pyspark.sql import SparkSession my_spark = SparkSession \ .builder \ .appName("myApp") \ .config("spark.mongodb.input.uri", "mongodb://127.0.0.1:27017/spark.test_pyspark_mbd_conn") \ .config("spark.mongodb.output.uri", "mongodb://127.0.0.1:27017/spark.test_pyspark_mbd_conn") \ .config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.2.2') \ .getOrCreate()

Disclaimer :

1) I do not know if this answer is the right place / SO question to pose this; please report the best place and I will translate it.

2) If you think that I was wrong or I have improvements in the process described above, please comment and I will review.

Thierry barnier · Answer 5 · 2017-05-02T18:09:05+0000

 import os import sys spark_home = os.environ.get('SPARK_HOME', None) sys.path.insert(0, spark_home + "/python") sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.10.4-src.zip'))

Here it is.

 sys.path.insert(0, <PATH TO YOUR JAR>)

Then...

 import pyspark import numpy as np from pyspark import SparkContext sc = SparkContext("local[1]") . . .

Add Jar to standalone pyspark

More articles: