Sparkling low hanging fruit optimization, especially catalyst and spark configuration optimizer

Question

Sparkling low hanging fruit optimization, especially catalyst and spark configuration optimizer

I use Spark 2.1.1 and I use the Scala API, although the language is less important. I am interested in optimizing spark requests / pipelines optimally. I read a lot of material (including the great book Exploring Sparks, and I am well acquainted with the Spark site , the blog Jacek Laskowski and others, and I have been working with spark light for almost two years.

However, there is so much information and concepts to be aware of, and I am not optimizing them enough to know them all. Unfortunately, as soon as everything works 100%, it may take only a few days or even hours before the code is delivered. I need to identify priorities that I can apply. I used to optimize a working spark code, but I am looking for the best overall strategy, and also try to be familiar with the best low-hanging fruits to catch them. Someday I will remember all the pens to set up, but at least ten good ones will do. Some of the things that I now consider important are (actually not ordered in order, but the first 4 are perhaps the most important, I think).

Development - Reduce shuffling (exchanges) by re-splitting the data set or by extracting bushes with a bucket from the table.
Strategy. Take a look at the Spark user interface to find out which work and scene took the longest, and look closer to that.
Development is a set of filters before combining, to the extent possible, to avoid creating high-power many-to-many connections and to avoid sending more data during connections.
Configuration - align performers and memory properly
Development - Stay away from Cartesian products and theta compounds as much as possible.
Development. Use the curvature library functions before creating the UDF, if possible.
Development. Try to force the hash of the translation if the table is small enough.
Strategy. Never use the RDD API instead of the Dataset / Dataframe unless there is a specific reason (which means that I never use the RDD API).
Development. Create Dataset filters so that the push-down predicate works with them (create more, simpler filters instead of filters with several conditions).
Strategy and development — always keep your spark source open so that it is easier to find type declarations and other code implementations.
Something I missed ...

The most interesting improvements for me are those that are obvious when viewing a query plan or visualizing a DAG. Also, the truisms that made spark users / developers go "Aha!" what you might want to share. Disclaimer Ten things are not the “top ten” for me, like using spark library library functions instead of UDF. This is not very important (of course, not at least a tenth), but I wanted to help give some examples of how good a review might seem to someone.

+8

scala apache-spark apache-spark-sql spark-dataframe apache-spark-2.0

big_mike_boiii Dec 6 '17 at 2:05

source share

No one has answered this question yet.

See related questions:

24

Make a typed connection in Scala with Spark datasets

5

How to use PySpark UDF in a Scala Spark project?

5

Disable Spark Catalyst Optimizer

4

Spark translates sql window function to RDD for better performance

2

spark-forming column into an effective set

one

Spark SQL: Catalyst scans unwanted columns

one

What happens when I call rdd.join (rdd)

one

How to align timestamps from two datasets in Apache Spark

one

How to change the query plan before execution (possibly turn off optimization)?

0

Efficient filtering on Spark SQL frames with ordered keys

Sparkling low hanging fruit optimization, especially catalyst and spark configuration optimizer

More articles: