I am writing a simple spark application that uses some input RDD, sends it to an external script through a pipe, and writes the output of this script to a file. The driver code is as follows:
val input = args(0) val scriptPath = args(1) val output = args(2) val sc = getSparkContext if (args.length == 4) { //Here I pass an additional argument which contains an absolute path to a script on my local machine, only for local testing sc.addFile(args(3)) } sc.textFile(input).pipe(Seq("python2", SparkFiles.get(scriptPath))).saveAsTextFile(output)
When I run it on my local machine, it works fine. But when I send it to the YARN cluster through
spark-submit --master yarn --deploy-mode cluster --files /absolute/path/to/local/test.py --class somepackage.PythonLauncher path/to/driver.jar path/to/input/part-* test.py path/to/output`
failure is excluded.
Lost task 1.0 in stage 0.0 (TID 1, rwds2.1dmp.ru): java.lang.Exception: Subprocess exited with status 2
I tried different variations of the pipe command. For example .pipe("cat") works fine and works as expected, but .pipe(Seq("cat", scriptPath)) also fails with error code 1, so it seems that the spark cannot determine the path to the script in the cluster node.
Any suggestions?
hdfs yarn apache-spark
Alexander Tokarev
source share