Spark gearbox concept

Question

Spark gearbox concept

I come from the Hadoop background and have limited knowledge about Sparks. BASed by what I have learned so far, Spark does not have a crankcase and gearbox nodes, and instead it has driver / worker nodes. The worker is like a cartographer, and the driver (somehow) is like a gearbox. Since there is only one driver program, there will be one reducer. If so, how can simple programs, such as word counting for very large data sets, be made in spark mode? Since the driver may simply run out of memory.

+5

apache-spark

Hz Jul 31 '15 at 16:58

source share

1 answer

Justin pihony · Accepted Answer · 2015-07-31T17:36:45+0000

The driver is rather a controller of work, but only pulls the data back if the operator requires it. If the operator you are working with returns an RDD / DataFrame / Unit, the data remains distributed. If it returns the native type, it really will return all the data.

Otherwise, the concept of map and abbreviation is a little outdated here (from the type of work pursuing). The only thing that really matters is whether the shuffling operation requires data or not. You can see shuffle points by splitting steps either in the user interface or via toDebugString (where each level of indentation is a shuffle).

All that is said, for a vague understanding, you can equate everything that requires shuffling to the gearbox. Otherwise, it is a cartographer.

Finally, to equate an example of your example words:

sc.textFile(path) .flatMap(_.split(" ")) .map((_, 1)) .reduceByKey(_+_)

In the above example, this will be done in one step, since data loading ( textFile ), splitting ( flatMap ) and map ping can be performed independently of the rest of the data. No shuffling is required until reduceByKey is reduceByKey , since it will need to combine all the data to complete the operation ... HOWEVER , this operation must be associative for some reason. Each node performs the operation defined in reduceByKey locally, only merging the final data set after. This reduces both memory and network costs.

NOTE that reduceByKey returns an RDD and thus a transformation , so the data is shuffled through the HashPartitioner . All data does NOT return to the driver; it simply moves to nodes that have the same keys, so that it can combine its final value.

Now, if you use action , such as reduce or worse, collect , you will not get the RDD backlink, which means that the data will return to the driver and you need a place for it.

Here is my more complete explanation of reduceByKey if you want more . Or how it breaks down in something like combineByKey

Spark gearbox concept

More articles: