I am new to Spark and trying to understand the concept of Spark with Python. When using Python to develop applications for Spark, I'm a little confused about how to process my data in parallel style.
1 . Everyone says that I donβt need to worry about what node is and how many nodes will be invested in the processing of my data encapsulated in RDD variables. Therefore, based on my best understanding, I believe that the Spark cluster will do with the code below:
a = sc.textFile(filename) b = a.filter(lambda x: len(x) > 0 and x.split("\t").count("9999-12-31") == 0) c = b.collect()
can be described as the following steps:
(1) The variable a will be saved as an RDD variable containing the expected contents of the txt file
(2) Different fragments of RDD a will be broadcast to different nodes in the cluster, and the filtering method will be performed for each fragment in different nodes
(3) when the collection action is called, the results will be returned to the master from different nodes and saved as a local variable, c .
Did I describe my description correctly? If not, what will be the procedure? If I'm right, what's the point of parallelizing a method? Does the following code have the same thing as above?
a = sc.textFile(filename).collect() b = sc.parallelize(a).filter(lambda x: len(x)>0 and x.split("\t").count("9999-12-31")) c = b.collect()
2 . For the following code, the SQL query syntax will be processed in parallel, dividing a specific table into many sections?
a = sc.textFile(filename) b = a.filter(lambda x: len(x) > 0 and x.split("\t").count("9999-12-31") == 0) parts = b.map(lambda x: x.split("\t")) records = parts.map(Row(r0 = str(x[0]), r1 = x[1], r2 = x[2])) rTable = sqlContext.createDataFrame(records) rTable.registerTempTable("rTable") result = sqlContext.sql("select substr(r0,1,2), case when r1=1 then r1*100 else r1*10 end, r2 from rTable").collect()