RDD List Multiprocessing

I am trying to execute a multiprocessor RDD list as follows

from pyspark.context import SparkContext
from multiprocessing import Pool



def square(rdd_list):
    def _square(i):
        return i*i
    return rdd_list.map(_square)

sc = SparkContext('local', 'Data_Split')
data = sc.parallelize([1,2,3,4,5,6])

dataCollection = [data, data, data]

p = Pool(processes=2)
result = p.map(square, dataCollection)
print result[0].collect()

I expect an RDD list in the output with each element containing a square of elements from the data.

But when you run the code, the following error occurs:

Exception: It looks like you are trying to translate an RDD or reference an RDD from an action or conversion. RDD transformations and actions can only be called by the driver, and not inside another transform; for example, rdd1.map (lambda x: rdd2.values.coun \ t () * x) is invalid because the conversion of values ​​and the counting action cannot be performed inside the conversion of rdd1.map. See SPARK-5063 for more information.

My questions: -

1) Why is the code not working properly? How can i fix this?

2) ( ) , p.map () RDD.

+4
1

, , , RDD /, . Spark , RDD, .

+2

All Articles