I recently started using spark, and I want to run a spark job from a Spring web application.
I have a situation where I run a web application on a Tomcat server using Spring boot.My web application receives a REST web service request based on this. He should run the task of calculating the curvature in a cluster of yarn. Since my work may take longer and can access data from HDFS, so I want to run the spark job in direct cluster mode and I donโt want to save the light context in my web layer. Another reason for this is that my application is a multi-tenant so that each tenant can do their own work, so in the thread cluster mode, each tenant's job can launch its own driver and run its own spark cluster in it. In the web application JVM, I assume that I cannot run multiple spark contexts in the same JVM.
I want to run spark jobs in yarn-cluster mode from a java program in my web application. what is the best way to achieve this. Iโm exploring different options and looking at your best guide.
1) I can use the spark-submit command line shell to submit my assignments. But to start it from my web application, I need to use either the Java ProcessBuilder api, or some kind of package built on Java ProcessBuilder. This has 2 questions. At first it doesn't seem like a clean way to do it. I must have a software way to run my spark applications. The second problem will be that I will lose the ability to monitor the submitted application and obtain its status. The only rough way to do this is to read the output from the spark-submit shell, which again does not seem like a good approach.
2) I tried using the Yarn client to send jobs from a Spring application. Below is the code that I use to supply spark work using a yarn client:
Configuration config = new Configuration(); System.setProperty("SPARK_YARN_MODE", "true"); SparkConf conf = new SparkConf(); ClientArguments cArgs = new ClientArguments(sparkArgs, conf); Client client = new Client(cArgs, config, conf); client.run();
But when I run the above code, it tries to connect only to localhost. I get this error:
5/08/05 14:06:10 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/08/05 14:06:12 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
Therefore, I do not think that it can connect to a remote machine.
Please suggest what is best done with the latest spark. Later, I have plans to deploy this entire application on Amazon EMR. Therefore, the approach should work there.
Thank you in advance