Running Spark Jobs in a YARN Cluster with Additional Files

Question

Running Spark Jobs in a YARN Cluster with Additional Files

I am writing a simple spark application that uses some input RDD, sends it to an external script through a pipe, and writes the output of this script to a file. The driver code is as follows:

val input = args(0) val scriptPath = args(1) val output = args(2) val sc = getSparkContext if (args.length == 4) { //Here I pass an additional argument which contains an absolute path to a script on my local machine, only for local testing sc.addFile(args(3)) } sc.textFile(input).pipe(Seq("python2", SparkFiles.get(scriptPath))).saveAsTextFile(output)

When I run it on my local machine, it works fine. But when I send it to the YARN cluster through

 spark-submit --master yarn --deploy-mode cluster --files /absolute/path/to/local/test.py --class somepackage.PythonLauncher path/to/driver.jar path/to/input/part-* test.py path/to/output`

failure is excluded.

 Lost task 1.0 in stage 0.0 (TID 1, rwds2.1dmp.ru): java.lang.Exception: Subprocess exited with status 2

I tried different variations of the pipe command. For example .pipe("cat") works fine and works as expected, but .pipe(Seq("cat", scriptPath)) also fails with error code 1, so it seems that the spark cannot determine the path to the script in the cluster node.

Any suggestions?

+7

hdfs yarn apache-spark

Alexander Tokarev May 05 '15 at 8:24

source share

2 answers

You might want to try local:// and $SPARK_YARN_STAGING_DIR env var.

For example, the following should work:

 spark-submit \ --master yarn \ --deploy-mode cluster \ --files /absolute/path/to/local/test.py \ --class somepackage.PythonLauncher \ local://$SPARK_YARN_STAGING_DIR/test.py

+1

Tomdottom Aug 17 '16 at 13:13

source share

Yijie shen · Accepted Answer · 2015-05-05T09:08:21+0000

I do not use python itself, but I believe that some tips may be useful for you (in the source code of Spark-1.3 SparkSubmitArguments )

--py-files PY_FILES , a list of ZIP, .egg, or .py files, separated by commas, for placement in PYTHONPATH Python applications.
--files FILES , a list of files that will be placed in the working directory of each artist.
--archives ARCHIVES . The list of archives separated by commas should be extracted into the working directory of each artist.

And also your spark-submit arguments should follow this style:

Usage: spark-submit [options] <app jar | python file> [app arguments]

Running Spark Jobs in a YARN Cluster with Additional Files

More articles: