In Apache Spark. How to set worker / worker environment variables?

Question

In Apache Spark. How to set worker / worker environment variables?

My spark program on EMR constantly gets this error:

Caused by: javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated at sun.security.ssl.SSLSessionImpl.getPeerCertificates(SSLSessionImpl.java:421) at org.apache.http.conn.ssl.AbstractVerifier.verify(AbstractVerifier.java:128) at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:397) at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:148) at org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:149) at org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:121) at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:573) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:425) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754) at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:334) at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:281) at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:942) at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2148) at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectDetailsImpl(RestStorageService.java:2075) at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:1093) at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:548) at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:172) at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) at org.apache.hadoop.fs.s3native.$Proxy8.retrieveMetadata(Unknown Source) at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:414) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398) at org.apache.hadoop.fs.s3native.NativeS3FileSystem.create(NativeS3FileSystem.java:341) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:784)

I did some research and found out that this authentication can be disabled in a low security situation by setting an environment variable:

 com.amazonaws.sdk.disableCertChecking=true

but I can only install it with spark-submit.sh -conf, which only affects the driver, and most errors work.

Is there a way to extend them to workers?

Many thanks.

+8

amazon-s3 amazon-web-services distributed-computing apache-spark

tribbloid Mar 30 '15 at 18:59

source share

4 answers

Add more to the existing answer.

 import pyspark def get_spark_context(app_name): # configure conf = pyspark.SparkConf() conf.set('spark.app.name', app_name) # init & return sc = pyspark.SparkContext.getOrCreate(conf=conf) # Configure your application specific setting # Set environment value for the executors conf.set(f'spark.executorEnv.SOME_ENVIRONMENT_VALUE', 'I_AM_PRESENT') return pyspark.SQLContext(sparkContext=sc)

SOME_ENVIRONMENT_VALUE of SOME_ENVIRONMENT_VALUE will be available to performers / workers.

In your spark application, you can access them as follows:

 import os some_environment_value = os.environ.get('SOME_ENVIRONMENT_VALUE')

+1

Amit kushwaha May 14, '19 at 15:56

source share

Based on other answers, here is a complete example that works (PySpark 2.4.1). In this example, I force all workers to create only one thread per core in the Intel MKL Kernel library:

 import pyspark conf = pyspark.conf.SparkConf().setAll([ ('spark.executorEnv.OMP_NUM_THREADS', '1'), ('spark.workerEnv.OMP_NUM_THREADS', '1'), ('spark.executorEnv.OPENBLAS_NUM_THREADS', '1'), ('spark.workerEnv.OPENBLAS_NUM_THREADS', '1'), ('spark.executorEnv.MKL_NUM_THREADS', '1'), ('spark.workerEnv.MKL_NUM_THREADS', '1'), ]) spark = pyspark.sql.SparkSession.builder.config(conf=conf).getOrCreate() # print current PySpark configuration to be sure print("Current PySpark settings: ", spark.sparkContext._conf.getAll())

0

ivan_bilan Jun 27 '19 at 11:35

source share

For spark 2.4, the @Amit Kushwaha method does not work.

I checked:

1. cluster mode

 spark-submit --conf spark.executorEnv.DEBUG=1 --conf spark.appMasterEnv.DEBUG=1 --conf spark.yarn.appMasterEnv.DEBUG=1 --conf spark.yarn.executorEnv.DEBUG=1 main.py

2. customer mode

 spark-submit --deploy-mode=client --conf spark.executorEnv.DEBUG=1 --conf spark.appMasterEnv.DEBUG=1 --conf spark.yarn.appMasterEnv.DEBUG=1 --conf spark.yarn.executorEnv.DEBUG=1 main.py

None of the above can set the os.environ.get('DEBUG') variable in the os.environ.get('DEBUG') system (otherwise os.environ.get('DEBUG') cannot be read).

The only way to get from spark.conf:

Submit:

 spark-submit --conf DEBUG=1 main.py

get the variable:

 DEBUG = spark.conf.get('DEBUG')

0

Mithril Jul 15 '19 at 7:51

source share

stholzm · Accepted Answer · 2015-04-11T15:06:12+0000

Just stumbled upon something in the Spark documentation :

spark.executorEnv.[EnvironmentVariableName]

Add the environment variable specified by EnvironmentVariableName to the Executor process. The user can specify several of them to set several environment variables.

So in your case, I would set the Spark configuration parameter spark.executorEnv.com.amazonaws.sdk.disableCertChecking to true and see if that helps.

In Apache Spark. How to set worker / worker environment variables?

1. cluster mode

2. customer mode

The only way to get from spark.conf:

More articles: