Short answer: yes, but only for linear relationship.
Long answer: comparison of the Spark SQL / DataFrame query optimizer, almost nonexistent.
The Spark core API does not rewrite the DAG implementation plan, even its clearly advantageous. here is an example:
including DAG:
A > B > D > C >
Where D is collected and A is not saved (saving is an expensive operation, plus if you donβt know if D will be collected, you cannot decide when it should not be). Ideally, the optimizer should convert this DAG to a linear and much cheaper A> Tuple2 (B, C)> D. So let's test it:
val acc = sc.accumulator(0) val O = sc.parallelize(1 to 100) val A = O.map{ v => acc += 1 v * 2 } val B = A.map(_*2) val C = A.map(_*3) val D = B.zip(C).map(v => v._1 + v._2) D.collect() assert(acc.value == 100)
Result?
200 did not equal 100
It is understood that a non-optimized DAG is running.
In addition, such a function (or something close, for example, an optimizer for joining connections / shuffling connections) has never been proposed. Probably because most Spark developers prefer more direct control over execution, or such optimization has a very limited effect compared to what the SQL query optimizer can do.
tribbloid
source share