PySpark Rating

I am trying to use the following code, which adds a number to each row in RDD and returns an RDD list using PySpark.

from pyspark.context import SparkContext file = "file:///home/sree/code/scrap/sample.txt" sc = SparkContext('local', 'TestApp') data = sc.textFile(file) splits = [data.map(lambda p : int(p) + i) for i in range(4)] print splits[0].collect() print splits[1].collect() print splits[2].collect() 

The contents of the input file (sample.txt):

 1 2 3 

I expected this output (adding numbers to rdd with 0, 1, 2, respectively):

 [1,2,3] [2,3,4] [3,4,5] 

whereas the actual result was:

 [4, 5, 6] [4, 5, 6] [4, 5, 6] 

which means that understanding only used the value 3 for variable i, regardless of range (4).

Why is this behavior happening?

+6
source share
2 answers

This is due to late Python binding and is not (Py) Spark specific. i will be checked when using lambda p : int(p) + i , but not when it is defined. This usually means that when it is called, but in this specific context, when it is serialized for sending to workers.

You can do, for example, something like this:

 def f(i): def _f(x): try: return int(x) + i except: pass return _f data = sc.parallelize(["1", "2", "3"]) splits = [data.map(f(i)) for i in range(4)] [rdd.collect() for rdd in splits] ## [[1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]] 
+4
source

This is due to the fact that lambdas refer to me through a link! This has nothing to do with sparks. Watch it

You can try the following:

 a =[(lambda y: (lambda x: y + int(x)))(i) for i in range(4)] splits = [data.map(a[x]) for x in range(4)] 

or in one line

 splits = [ data.map([(lambda y: (lambda x: y + int(x)))(i) for i in range(4)][x]) for x in range(4) ] 
+2
source

All Articles