How to send a spark job to a remote node master in yarn client mode?

I need to send spark applications / jobs to a remote spark cluster. I am currently running on my machine and the IP address of the node wizard as a client yarn. Btw my car is not in a cluster. I submit my work with this command

./spark-submit --class SparkTest --deploy-mode client /home/vm/app.jar 

I have the address of my hard drive in my application in the form

 val spark_master = spark://IP:7077 

And still I get an error

 16/06/06 03:04:34 INFO AppClient$ClientEndpoint: Connecting to master spark://IP:7077... 16/06/06 03:04:34 WARN AppClient$ClientEndpoint: Failed to connect to master IP:7077 java.io.IOException: Failed to connect to /IP:7077 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:183) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.ConnectException: Connection refused: /IP:7077 

Or if I use

 ./spark-submit --class SparkTest --master yarn --deploy-mode client /home/vm/test.jar 

I get

 Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. at org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:251) at org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:228) at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:109) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:114) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 

Do I need to have hadoop configuration on my workstation? All work will be performed remotely, and this machine is not part of the cluster. I am using Spark 1.6.1.

+8
cluster-computing hadoop yarn apache-spark
source share
1 answer

First of all, if you install conf.setMaster(...) from your application code, it has the highest priority (over the --master argument). If you want to start the mode of operation of the yarn, do not use MASTER_IP: 7077 in the application code. You must provide hadoop client configuration files to your driver as follows.

You must set the environment variable HADOOP_CONF_DIR or YARN_CONF_DIR to point to the directory that contains the client configurations.

http://spark.apache.org/docs/latest/running-on-yarn.html

Depending on which hadoop functions you use in your spark application, some configuration files will be used to configure the search. If you use a hive (via HiveContext in spark-sql), it will look for hive-site.xml. hdfs-site.xml will be used to find coordinates for reading / writing NameNode in HDFS from your work.

+12
source share

All Articles