Scala spark application profiling

I would like to profile my scala spark applications to identify the parts of the code that I need to optimize. I included -Xprof in --driver-java-options, but that doesn’t help me much since it gives a lot of detailed details. I'm just curious to know how long it takes to call each function in my application. Like other stackoverflow related questions, many people have suggested YourKit, but it's not inexpensive. Therefore, I would like to use something that is not expensive, actually free.

Are there any better ways to solve this problem?

+5
source share
3 answers

I would recommend that you use directly the interface that provides the spark. It provides a lot of information and indicators regarding time, steps, network usage, etc.

You can learn more about this here: https://spark.apache.org/docs/latest/monitoring.html

In addition, the new version of Spark (1.4.0) has a good visualizer to understand the steps and steps of your spark assignments.

+8
source

As you said, profiling a distributed process is more complicated than profiling a single JVM process, but there are ways to achieve it.

You can use sampling as a method for profiling threads. Add a java agent to the executors that will capture the stack traces and then aggregate these stack traces to find out which methods in your application spend the most time.

For example, you can use the Etsy java agent statsd-jvm-profiler and configure it to send stack traces to InfluxDB , and then aggregate them using Flaming charts .

For more information, see your Spark app profiling post: https://www.paypal-engineering.com/2016/09/08/spark-in-flames-profiling-spark-applications-using-flame-graphs/

+3
source

I recently wrote an article and a script that wraps spark-submit and generates a flame graph after running a Spark application.

Here's the article: https://www.linkedin.com/pulse/profiling-spark-applications-one-click-michael-spector

Here's the script: https://raw.githubusercontent.com/spektom/spark-flamegraph/master/spark-submit-flamegraph

Just use it instead of the usual spark-submit .

+3
source

All Articles