Does `rdd.map (x => f (g (x))` have better performance than `rdd.map (g) .map (f)`?

Question

Does `rdd.map (x => f (g (x))` have better performance than `rdd.map (g) .map (f)`?

In sparks, we have two ways to control RDD.

One of them makes it as short as possible:

rdd.map(x => h(f(g(x))))

Another is a chain to make it more readable, for example:

 rdd.map(g).map(f).map(h)...

Personally, I like the later, which is more understandable. But some guys worry about performance, they think it is the same as:

 list.map(g).map(f).map(h)

and think that some temporary RDDs will exist during the chain, so they always use the first one.

It's true? Is there a performance issue to use chaining? I personally treat him as a Stream , and I don’t think they have a big difference in performance.

+4

performance scala apache-spark

Freewind Aug 18 '15 at 2:00

source share

1 answer

Justin pihony · Accepted Answer · 2015-08-18T02:52:32+0000

This is almost the same as the pipeline code.

The first is obvious what will happen when you seem clear, however the chain will lead to the following (simplified):

 MapPartitionsRDD( MapPartitionsRDD( MapPartitionsRDD( rdd, iter.map(g)), iter.map(f)), iter.map(h))

Simplification of further visualization:

 map(map(map(rdd,g),f),h)

What upon execution comes down to:

 h(f(g(rddItem)))

Sounds familiar? All this is a chain of pipeline computing ... brought to you by the joys of lazy appreciation.

This can be seen in the example:

 def f(x: Int) = {println(s"f$x");x} def g(x: Int) = {println(s"g$x");x} def h(x: Int) = {println(s"h$x");x} val rdd = sc.makeRDD(1 to 3, 1) rdd.map(x => h(f(g(x)))) g1 f1 h1 g2 f2 h2 g3 f3 h3 rdd.map(g).map(f).map(h) g1 f1 h1 g2 f2 h2 g3 f3 h3

Does `rdd.map (x => f (g (x))` have better performance than `rdd.map (g) .map (f)`?

More articles: