What spark operations are processed in parallel?

I try to bow my head to the whole concept of spark. I think I have a very rudimentary understanding of the spark platform. From what I understand, Spark has an RDD concept, which is a collection of “material” in memory, so processing is faster. You convert RDD using methods like map and flatmaps. Because conversions are lazy , they are not processed until you invoke an action on the final RDD. What I don’t understand about, when you perform an action, do parallel conversions occur? Can you appoint workers in parallel action?

For example, let's say I have a text file that I upload to RDD,

lines = //loadRDD
lines.map(SomeFunction())
lines.count()

What is really going on? Does SomeFunction () handle the RDD section? What is the parallel aspect?

+4
source share
2 answers

linesis simply the name of the RDD data structure that is in the driver, which is a partitioned list of strings. partitionsmanaged on each of your work nodes when they are needed.

count, Spark , (a partition), SomeFunction , . , , SomeFunction /.

, , .

. SomeFunction .

+3

An RDD - , . , node .

, , - . .

map . , A1, A2 A3, Spark , N1, N2 N3, . map(someFunction()) N1 someFunction A1, .

count, "N1, , ", node. , collect . , , , RDD node ( ..).

, , , , , , . Spark ( ), , , .

, Spark, . , , , .

+1

All Articles