Is my understanding of parallel operations in Sparks correct?

I am new to Spark and trying to understand the concept of Spark with Python. When using Python to develop applications for Spark, I'm a little confused about how to process my data in parallel style.

1 . Everyone says that I don’t need to worry about what node is and how many nodes will be invested in the processing of my data encapsulated in RDD variables. Therefore, based on my best understanding, I believe that the Spark cluster will do with the code below:

a = sc.textFile(filename) b = a.filter(lambda x: len(x) > 0 and x.split("\t").count("9999-12-31") == 0) c = b.collect() 

can be described as the following steps:

(1) The variable a will be saved as an RDD variable containing the expected contents of the txt file
(2) Different fragments of RDD a will be broadcast to different nodes in the cluster, and the filtering method will be performed for each fragment in different nodes
(3) when the collection action is called, the results will be returned to the master from different nodes and saved as a local variable, c .

Did I describe my description correctly? If not, what will be the procedure? If I'm right, what's the point of parallelizing a method? Does the following code have the same thing as above?

 a = sc.textFile(filename).collect() b = sc.parallelize(a).filter(lambda x: len(x)>0 and x.split("\t").count("9999-12-31")) c = b.collect() 

2 . For the following code, the SQL query syntax will be processed in parallel, dividing a specific table into many sections?

 a = sc.textFile(filename) b = a.filter(lambda x: len(x) > 0 and x.split("\t").count("9999-12-31") == 0) parts = b.map(lambda x: x.split("\t")) records = parts.map(Row(r0 = str(x[0]), r1 = x[1], r2 = x[2])) rTable = sqlContext.createDataFrame(records) rTable.registerTempTable("rTable") result = sqlContext.sql("select substr(r0,1,2), case when r1=1 then r1*100 else r1*10 end, r2 from rTable").collect() 
+6
source share
1 answer

Your first step description is correct. But there is still something about the second and third steps.

Second step:

According to the Spark documentation:

def textFile (path: String, minPartitions: Int = defaultMinPartitions): RDD [String]

The TextFile method also accepts an optional second argument to control the number of file sections. By default, Spark creates one partition for each file block (the default blocks are 64 MB in HDFS), but you can also request more partitions by passing a larger value.

If you put your file in HDFS and pass its path as the textFile parameter, RDD a partitions are created based on the HDFS blocks. Thus, in this case, the amount of palatalization depends on the number of HDFS blocks. Also, the data has already been partitioned and moved to cluster machines via HDFS.

If you use the path in the local file system (available on all nodes) and do not specify minPartitions , parallelism is selected by default (which depends on the number of cores in your cluster). In this case, you need to copy the file to each employee or place it in a shared storage accessible to each employee.

In each case, Spark avoids translating any data and instead tries to use existing blocks on each machine. So your second step is not quite right.

Third step

According to the Spark documentation:

collect (): array [T] Returns an array containing all the elements of this RDD

At this point, your RDD b shuffled / compiled into your / node driver program.

+1
source

All Articles