Sparkling low hanging fruit optimization, especially catalyst and spark configuration optimizer

I use Spark 2.1.1 and I use the Scala API, although the language is less important. I am interested in optimizing spark requests / pipelines optimally. I read a lot of material (including the great book Exploring Sparks, and I am well acquainted with the Spark site , the blog Jacek Laskowski and others, and I have been working with spark light for almost two years.

However, there is so much information and concepts to be aware of, and I am not optimizing them enough to know them all. Unfortunately, as soon as everything works 100%, it may take only a few days or even hours before the code is delivered. I need to identify priorities that I can apply. I used to optimize a working spark code, but I am looking for the best overall strategy, and also try to be familiar with the best low-hanging fruits to catch them. Someday I will remember all the pens to set up, but at least ten good ones will do. Some of the things that I now consider important are (actually not ordered in order, but the first 4 are perhaps the most important, I think).

  • Development - Reduce shuffling (exchanges) by re-splitting the data set or by extracting bushes with a bucket from the table.
  • Strategy. Take a look at the Spark user interface to find out which work and scene took the longest, and look closer to that.
  • Development is a set of filters before combining, to the extent possible, to avoid creating high-power many-to-many connections and to avoid sending more data during connections.
  • Configuration - align performers and memory properly
  • Development - Stay away from Cartesian products and theta compounds as much as possible.
  • Development. Use the curvature library functions before creating the UDF, if possible.
  • Development. Try to force the hash of the translation if the table is small enough.
  • Strategy. Never use the RDD API instead of the Dataset / Dataframe unless there is a specific reason (which means that I never use the RDD API).
  • Development. Create Dataset filters so that the push-down predicate works with them (create more, simpler filters instead of filters with several conditions).
  • Strategy and development β€” always keep your spark source open so that it is easier to find type declarations and other code implementations.
  • Something I missed ...

The most interesting improvements for me are those that are obvious when viewing a query plan or visualizing a DAG. Also, the truisms that made spark users / developers go "Aha!" what you might want to share. Disclaimer Ten things are not the β€œtop ten” for me, like using spark library library functions instead of UDF. This is not very important (of course, not at least a tenth), but I wanted to help give some examples of how good a review might seem to someone.

+8
scala apache-spark apache-spark-sql spark-dataframe
source share

No one has answered this question yet.

See related questions:

24
Make a typed connection in Scala with Spark datasets
5
How to use PySpark UDF in a Scala Spark project?
5
Disable Spark Catalyst Optimizer
4
Spark translates sql window function to RDD for better performance
2
spark-forming column into an effective set
one
Spark SQL: Catalyst scans unwanted columns
one
What happens when I call rdd.join (rdd)
one
How to align timestamps from two datasets in Apache Spark
one
How to change the query plan before execution (possibly turn off optimization)?
0
Efficient filtering on Spark SQL frames with ordered keys

All Articles