How to allow external packets with a spark shell when behind a corporate proxy?

I would like to launch a spark shell with an external package behind the corporate proxy. Unfortunately, external packages transferred using the --packages option --packages not allowed.

For example, at startup

 bin/spark-shell --packages datastax:spark-cassandra-connector:1.5.0-s_2.10 

cassandra connector package not allowed (stuck on last line):

 Ivy Default Cache set to: /root/.ivy2/cache The jars for the packages stored in: /root/.ivy2/jars :: loading settings :: url = jar:file:/opt/spark/lib/spark-assembly-1.6.1-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml datastax#spark-cassandra-connector added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] 

After some time, the connection time contains error messages similar to this:

 :::: ERRORS Server access error at url https://repo1.maven.org/maven2/datastax/spark-cassandra-connector/1.5.0-s_2.10/spark-cassandra-connector-1.5.0-s_2.10.pom (java.net.ConnectException: Connection timed out) 

When I deactivate a VPN with a corporate proxy, the packet will be allowed and immediately downloaded.

What I have tried so far:

Mapping proxies as environment variables:

 export http_proxy=<proxyHost>:<proxyPort> export https_proxy=<proxyHost>:<proxyPort> export JAVA_OPTS="-Dhttp.proxyHost=<proxyHost> -Dhttp.proxyPort=<proxyPort>" export ANT_OPTS="-Dhttp.proxyHost=<proxyHost> -Dhttp.proxyPort=<proxyPort>" 

Starting a spark shell with additional java options:

 bin/spark-shell --conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=<proxyHost> -Dhttp.proxyPort=<proxyPort>" --conf "spark.executor.extraJavaOptions=-Dhttp.proxyHost=<proxyHost> -Dhttp.proxyPort=<proxyPort>" --packages datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 

Is there any other configuration option that I don't see?

+10
source share
7 answers

Found the correct settings:

 bin/spark-shell --conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=<proxyHost> -Dhttp.proxyPort=<proxyPort> -Dhttps.proxyHost=<proxyHost> -Dhttps.proxyPort=<proxyPort>" --packages <somePackage> 

Both HTTP and https proxies must be set as additional driver parameters. JAVA_OPTS does nothing.

+19
source

This worked for me in spark 1.6.1:

 bin\spark-shell --driver-java-options "-Dhttp.proxyHost=<proxyHost> -Dhttp.proxyPort=<proxyPort> -Dhttps.proxyHost=<proxyHost> -Dhttps.proxyPort=<proxyPort>" --packages <package> 
+4
source

Add

 spark.driver.extraJavaOptions=-Dhttp.proxyHost=<proxyHost> -Dhttp.proxyPort=<proxyPort> -Dhttps.proxyHost=<proxyHost> -Dhttps.proxyPort=<proxyPort> 

to $SPARK_HOME/conf/spark-defaults.conf works for me.

+1
source

If you need authentication to use a proxy, you can use below in the default configuration file:

 spark.driver.extraJavaOptions -Dhttp.proxyHost= -Dhttp.proxyPort= -Dhttps.proxyHost= -Dhttps.proxyPort= -Dhttp.proxyUsername= -Dhttp.proxyPassword= -Dhttps.proxyUsername= -Dhttps.proxyPassword= 
+1
source

Struggled with pyspark until I found this:

Adding to @Tao Huang's answer:

bin/pyspark --driver-java-options="-Dhttp.proxyUser=user -Dhttp.proxyPassword=password -Dhttps.proxyUser=user -Dhttps.proxyPassword=password -Dhttp.proxyHost=proxy -Dhttp.proxyPort=port -Dhttps.proxyHost=proxy -Dhttps.proxyPort=port" --packages [groupId:artifactId]

Those. should be -Dhttp (s). proxyUser instead ... proxyUsername

+1
source

In windows 7 with spark 2.0.0-bin-hadoop2.7 I set spark.driver.extraJavaOptions to %SPARK_HOME%"\spark-2.0.0-bin-hadoop2.7\conf\spark-defaults.conf as:

 spark.driver.extraJavaOptions -Dhttp.proxyHost=hostname -Dhttp.proxyPort=port -Dhttps.proxyHost=host -Dhttps.proxyPort=port 
0
source

If the proxy server is configured correctly on your OS, you can use the java: java.net.useSystemProxies property:

--conf "spark.driver.extraJavaOptions=-Djava.net.useSystemProxies=true"

therefore, proxy hosts / ports and hosts without a proxy will be configured.

0
source

All Articles