I try to bow my head to the whole concept of spark. I think I have a very rudimentary understanding of the spark platform. From what I understand, Spark has an RDD concept, which is a collection of “material” in memory, so processing is faster. You convert RDD using methods like map and flatmaps. Because conversions are lazy , they are not processed until you invoke an action on the final RDD. What I don’t understand about, when you perform an action, do parallel conversions occur? Can you appoint workers in parallel action?
action
For example, let's say I have a text file that I upload to RDD,
lines = //loadRDD lines.map(SomeFunction()) lines.count()
What is really going on? Does SomeFunction () handle the RDD section? What is the parallel aspect?
linesis simply the name of the RDD data structure that is in the driver, which is a partitioned list of strings. partitionsmanaged on each of your work nodes when they are needed.
lines
partitions
count, Spark , (a partition), SomeFunction , . , , SomeFunction /.
count
partition
SomeFunction
, , .
. SomeFunction .
An RDD - , . , node .
RDD
, , - . .
map . , A1, A2 A3, Spark , N1, N2 N3, . map(someFunction()) N1 someFunction A1, .
map
map(someFunction())
someFunction
count, "N1, , ", node. , collect . , , , RDD node ( ..).
collect
, , , , , , . Spark ( ), , , .
, Spark, . , , , .