I use Spark 2.1.1 and I use the Scala API, although the language is less important. I am interested in optimizing spark requests / pipelines optimally. I read a lot of material (including the great book Exploring Sparks, and I am well acquainted with the Spark site , the blog Jacek Laskowski and others, and I have been working with spark light for almost two years.
However, there is so much information and concepts to be aware of, and I am not optimizing them enough to know them all. Unfortunately, as soon as everything works 100%, it may take only a few days or even hours before the code is delivered. I need to identify priorities that I can apply. I used to optimize a working spark code, but I am looking for the best overall strategy, and also try to be familiar with the best low-hanging fruits to catch them. Someday I will remember all the pens to set up, but at least ten good ones will do. Some of the things that I now consider important are (actually not ordered in order, but the first 4 are perhaps the most important, I think).
- Development - Reduce shuffling (exchanges) by re-splitting the data set or by extracting bushes with a bucket from the table.
- Strategy. Take a look at the Spark user interface to find out which work and scene took the longest, and look closer to that.
- Development is a set of filters before combining, to the extent possible, to avoid creating high-power many-to-many connections and to avoid sending more data during connections.
- Configuration - align performers and memory properly
- Development - Stay away from Cartesian products and theta compounds as much as possible.
- Development. Use the curvature library functions before creating the UDF, if possible.
- Development. Try to force the hash of the translation if the table is small enough.
- Strategy. Never use the RDD API instead of the Dataset / Dataframe unless there is a specific reason (which means that I never use the RDD API).
- Development. Create Dataset filters so that the push-down predicate works with them (create more, simpler filters instead of filters with several conditions).
- Strategy and development β always keep your spark source open so that it is easier to find type declarations and other code implementations.
- Something I missed ...
The most interesting improvements for me are those that are obvious when viewing a query plan or visualizing a DAG. Also, the truisms that made spark users / developers go "Aha!" what you might want to share. Disclaimer Ten things are not the βtop tenβ for me, like using spark library library functions instead of UDF. This is not very important (of course, not at least a tenth), but I wanted to help give some examples of how good a review might seem to someone.
big_mike_boiii
source share