What is the correct way to run a Spark application on YARN using Oozie (with Hue)?

I wrote an application in Scala that uses Spark.
The application consists of two modules - the App module, which contains classes with different logic and the Env module, which contains the environment and system initialization code, as well as utility functions.
The entry point is in Env , and after initialization, it creates a class in the App (according to args , using Class.forName ), and the logic is executed.
Modules are exported to two different JARs (namely env.jar and app.jar ).

When I run the application locally, it runs well. The next step is to deploy the application to my servers. I am using Cloudera CDH 5.4.

I used Hue to create a new Oozie workflow with a Spark task with the following options:

  • Spark Master: yarn
  • Mode: cluster
  • Application name: myApp
  • Jars / py files: lib/env.jar,lib/app.jar
  • Main class: env.Main (in the Env module)
  • Arguments: app.AggBlock1Task

Then I placed 2 JARs inside the lib folder in the workflow folder ( /user/hue/oozie/workspaces/hue-oozie-1439807802.48 ).

When I start the workflow, it throws a FileNotFoundException and the application does not execute:

 java.io.FileNotFoundException: File file:/cloudera/yarn/nm/usercache/danny/appcache/application_1439823995861_0029/container_1439823995861_0029_01_000001/lib/app.jar,lib/env.jar does not exist 

However, when I leave the Master and Spark parameters empty, everything works correctly, but when I program spark.master programmatically, it is set to local[*] , not yarn . Also, when I watched the logs, I came across this in setting up the Oozie Spark action:

 --master null --name myApp --class env.Main --verbose lib/env.jar,lib/app.jar app.AggBlock1Task 

I assume that I am not doing this correctly - without setting the parameters of the wizard and Spark mode and starting the application with spark.master set to local[*] . As I understand it, creating a SparkConf object in an application should set the spark.master property as specified in Oozie (in this case yarn ), but it just doesn't work when I do this.

Is there something I'm doing wrong or missing?
Any help would be greatly appreciated!

+5
source share
1 answer

I managed to solve the problem by putting two JARs in the user directory /user/danny/app/ and specifying the Jar/py files parameter as ${nameNode}/user/danny/app/env.jar . The launch threw a ClassNotFoundException , although the JAR was located in the same folder in HDFS. To get around this, I had to go to the settings and add the following to the list of options: --jars ${nameNode}/user/danny/app/app.jar . Thus, it refers to the App module, and the application runs successfully.

+2
source

All Articles